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I.  INTRODUCTION 

A.  BACKGROUND 

Organizations  have  over  time  developed  many  disparate  databases  to 
manage  information.  These  databases  have  been  implemented  using  a  wide 
variety  of  incompatible  models,  languages  and  storage  methods.  Migration  of 
database  systems  to  an  integrated  strategic  Information  Resource  based 
architecture  will  require  the  interoperability  of  these  diverse  sources  of  data. 
Conflicts  among  these  heterogeneous  databases  will  impede  consolidation 
efforts. 

This  problem  can  be  addressed  in  the  near  term  by  transforming  the 
schemas  of  incompatible  data  organizations,  such  as  hierarchical  and  rela- 
tional, into  a  common  data  model  which  will  capture  all  information  con- 
tained in  the  original  databases  and  make  it  available  to  the  user  in  a  unified 
form.  The  Enhanced  Entity-Relationship  model,  which  is  both  semantically 
rich  and  conceptually  simple,  can  serve  as  an  integrating  model  for  combining 
the  data  from  different  databases. 

With  the  independent  databases  represented  in  equivalent  schemas,  a 
framework  for  the  identification,  classification,  and  resolution  of  semantic 
data  conflicts  can  be  developed.  The  integrated  global  schema  can  guide  the 
formulation  of  queries,  and  the  detailed  understanding  of  semantic  conflicts 
among  the  component  databases  resulting  from  the  re-engineering  process  be 
incorporated  in  the  design  of  a  global  controller  which  can  manage  the 
retrieval  of  information  from  a  federated  database  application.  Users 


requiring  access  to  data  from  several  disjoint  databases  can  then  process 
queries  against  the  reconciled  common  schema. 

In  the  longer  term,  data  element  standardization  efforts  may  obviate 
many  of  the  semantic  conflicts  addressed  by  this  thesis.  However,  different 
preferred  forms  of  organizing  corporate  information  will  remain  specific  to 
various  functional  domains.  Tools  for  the  integration  of  data  from  heteroge- 
neous databases  will  still  be  required. 

B.  OBJECTIVES 

The  objective  of  this  thesis  is  to  develop  a  framework  for  identifying, 
classifying,  and  resolving  semantic  conflicts  using  the  Enhanced  Entity- 
Relationship  model.  This  includes  transforming  heterogeneous  databases  into 
a  common  schema  for  comparison  and  identification  of  semantic  conflicts, 
illustrating  all  possible  forms  of  semantic  conflict,  both  at  the  schema  and  at 
the  data  level.  Using  real  world  examples,  the  classification  framework  will 
be  applied  to  diverse  database  applications  in  the  course  of  an  integration 
effort.  Finally,  this  analysis  will  suggest,  in  general  terms,  resolutions  to  the 
various  semantic  conflicts  identified  through  the  use  of  the  framework.  Inte- 
gration of  the  component  databases  into  a  global  schema  and  design  guide- 
lines for  the  implementation  of  a  global  controller  completes  the  objectives. 

C.  RESEARCH  QUESTIONS 

The  following  research  questions  are  addressed  in  this  thesis: 

(1)  What  qualities  are  needed  in  an  integrating  data  model  to  integrate 
data  from  multiple  sources? 

(2)  What  types  of  semantic  data  conflicts  arise  in  heterogeneous  data- 
bases, and  what  is  an  appropriate  framework  for  classifying 
semantic  conflicts? 


(3)  How  can  semantic  data  conflicts  best  be  identified  and  resolved  to 
allow  integrated  access  to  corporate  information  stored  in  databases 
using  different  data  models,  definitions,  and  constraints? 

(4)  How  might  semantic  conflicts  be  resolved  to  allow  the  formation  of  a 
common  global  schema  incorporating  heterogeneous  databases 
which  use  different  data  models,  definitions,  and  constraints,  and 
what  guidance  can  the  re-engineering  process  give  toward  the 
design  of  a  global  controller  component  for  a  federated  database 
application? 

D.  SCOPE  AND  LIMITATIONS 

This  thesis  will  review  the  potential  semantic  data  conflicts  which  can 
arise  in  heterogeneous  databases,  and  develop  a  framework  for  classifying 
these  conflicts.  A  common  data  model  for  use  in  integrating  diverse  sources  of 
data  is  examined  and  evaluated  for  appropriate  qualities.  General  measures 
to  resolve  these  conflicts  with  the  aim  of  integrating  the  data  and  useful  in 
the  design  of  a  functional  federated  database  controller  will  be  examined. 

This  thesis  will  not  address  heterogeneity  at  the  platform  or  Database 
Management  System  level.  Real  world  database  specifications  from  Depart- 
ment of  Defense  users  will  be  analyzed  at  the  level  of  descriptive  detail 
obtained.  Reasonable  assumptions  will  be  made  (in  terms  of  detailed  data 
definitions,  etc.),  where  necessary,  to  illustrate  the  types  of  semantic  conflict 
under  discussion. 

E.  METHODOLOGY 

The  methodologies  used  in  this  thesis  include: 

(1)  Obtaining  database  specifications  for  several  real  world  applications 
from  the  same  functional  domain  (i.e.,  administrative  personnel 
management). 

(2)  Examining  an  appropriate  common  integrating  conceptual  data 
model  for  integration  of  diverse  databases. 


(3)  Transforming  the  separate  databases  into  equivalent  schemas, 
using  the  conceptual  integrating  data  model. 

(4)  Analyzing  and  comparing  the  equivalent  schemas  to  develop  a 
framework  for  identifying  and  classifying  all  possible  semantic 
conflicts. 

(5)  Exploring  possible  solutions  to  the  conflicts  identified,  and  using 
the  framework  and  resolution  heuristics  to  integrate  a  global 
schema  which  subsumes  all  available  data  from  the  candidate 
databases. 

(6)  Using  the  knowledge  of  semantic  conflicts  gained  to  suggest  design 
strategies  for  a  global  controller  component  to  manage  a  federated 
database  including  the  subject  databases. 

(7)  Reviewing  the  experience  of  the  integration  process  to  suggest 
future  areas  of  research  into  useful  techniques  for  resolving  seman- 
tic heterogeneity. 

F.     ORGANIZATION 

The  thesis  is  organized  in  the  following  manner: 

Chapter  II  addresses  the  proliferation  of  heterogeneous  databases  in 
organizations.  This  includes  an  analysis  of  various  levels  of  heterogeneity, 
and  suggests  sources  of  different  kinds  of  conflicts. 

Chapter  III  reviews  the  required  qualities  of  a  suitable  integrating 
model,  with  particular  mention  of  the  various  types  of  existing  databases 
which  might  have  to  be  modeled.  The  common  conceptual  model  used  in  this 
thesis  is  explained,  and  examples  are  given  of  the  diagrammatic  conventions 
used  in  following  chapters. 

Chapter  IV  presents  a  real  world  scenario  of  heterogeneous  databases 
drawing  on  specifications  obtained  from  various  Department  of  Defense  func- 
tional applications.  The  federated  database  approach  to  integration  is  des- 
cribed, including  the  role  of  the  global  controller  component  which  manages 


the  resolution  of  semantic  conflicts  at  the  functional  level.  Each  database  is 
transformed  into  a  common  equivalent  schema  using  the  integrating  model. 

In  Chapter  V,  the  equivalent  schemas  developed  in  Chapter  VI  are 
systematically  compared  to  form  a  classification  framework  of  semantic  het- 
erogeneity. Examples  of  each  type  of  semantic  conflict  are  illustrated  and  dis- 
cussed based  on  the  specifications  detailed  in  the  appendices. 

Chapter  VI  explores  in  general  terms  possible  means  of  resolving  each 
type  of  semantic  conflict  expressed  in  the  classification  framework.  The  pro- 
posed solutions  are  then  applied  to  the  individual  schemas  to  create  a  com- 
mon global  schema  which  includes  all  information  originally  available. 
Additionally,  this  chapter  applies  the  semantic  conflict  framework  to  theoret- 
ical design  considerations  of  a  federated  database  approach  to  integration. 

The  concluding  chapter  reviews  lessons  learned  in  the  course  of  integrat- 
ing real  world  heterogeneous  databases,  and  offers  conclusions  about  identify- 
ing and  resolving  semantic  conflicts  between  databases.  Recommendations 
and  suggested  areas  of  future  research  are  offered  based  on  the  results  of  this 
analysis. 


II.  PROLIFERATION  OF  HETEROGENEOUS  DATABASES 

IN  ORGANIZATIONS 


A.     HETEROGENEITY  IN  DATABASES 

In  a  perfect  world,  the  advantages  of  interoperability  would  motivate 
end  users,  designers,  and  developers  to  ensure  that  seamless  and  effective 
information  sharing  were  built  into  database  applications  from  the  ground 
up.  Still,  heterogeneous  databases  have  proliferated  throughout  organiza- 
tions for  a  variety  of  reasons.  A  (largely)  homogeneous  paradigm  would  be 
practical  for  an  organization  entering  the  database  field  from  a  standing 
start,  with  access  to  the  full  spectrum  of  currently  available  technology. 
Gradual  evolution,  however,  has  resulted  in  the  current  situation. 
Organizations  such  as  DoD  have  continuously  developed  database  applica- 
tions over  40  years.  Recurring  cycles  of  hardware,  software,  and  technology 
during  that  time  have  all  contributed  to  the  diversity  of  databases  in  use 
today.  In  addition  to  these  essentially  technical  issues,  the  incremental, 
disjoint,  and  arbitrary  implementation  of  conceptual  design  methodologies 
has  contributed  to  the  present  chaotic  assortment  of  incompatible  systems. 

This  evolution  has  resulted  in  a  database  environment  with  three  levels 
of  heterogeneity.  At  the  lowest  level,  different  database  applications  are 
implemented  on  a  wide  range  of  hardware  platforms.  Similar  hardware  can 
run  a  variety  of  operating  systems.  Distributed  databases  must  communicate, 
using  compatible  communications  protocols.  Variation  in  these  protocols 
introduces  more  conflict.  At  the  next  level,  Data  Base  Management  Systems 


(DBMSs)  may  be  incompatible,  even  when  intended  to  work  with  similar  data 
structures.  Finally,  when  data  is  named,  defined  and  organized  into  a 
particular  architecture,  subjective  design  choices  introduce  fundamental  and 
potentially  intractable  semantic  conflicts.  Platform  and  DBMS  heterogeneities 
are  discussed  in  section  one,  while  semantic  heterogeneity  is  discussed  in 
section  two. 

This  thesis  deals  with  semantic  conflicts  which  arise  at  the  schema,  or 
architectural  level  of  database  organization,  and  at  the  data  level.  Such  con- 
flict arises  from  both  technical  and  methodological  causes.  Incidental  hetero- 
geneity issues,  caused  by  hardware,  operating  system,  DBMS  software,  and 
communications  protocol  variations  are  not  addressed. 
1.      Platform  and  DBMS  Heterogeneity 

When  information  was  first  stored  for  electronic  manipulation  by 
computer  systems  independent  from  the  specific  application  programs  doing 
the  manipulation,  it  was  organized  as  'flat  files'.  These  were  simple,  identi- 
cally formatted  records,  accessed  by  the  application  program  through  the 
program's  explicit  knowledge  of  where  in  the  record  a  given  fact  could  be 
found.  No  attempt  was  made  to  make  associations  between  individual 
records,  since  each  was  treated  by  the  application  as  a  unique  piece  of 
information.  Additions  to  the  set  of  records  was  therefore  easy,  but  a  change 
in  the  structure  of  a  record  very  difficult,  since  the  entire  application  had  to 
be  rewritten  to  preserve  the  necessary  explicit  internal  map  of  the  record's 
structure. 

Initial  interest  in  database  research  centered  on  the  management  of 
data  in  business  applications  such  as  automated  payroll,  inventory,  and 
transaction  processing.  These  domains  required  efficiency  in  accessing  and 


modifying  very  large  amounts  of  data,  and  were  oriented  toward  well  defined, 
repetitive  processes  which  could  be  run  from  start  to  completion  in  a  batch 
mode.  Additionally,  these  first  databases  appeared  when  the  physical  limit- 
ations of  the  available  hardware  imposed  very  definite  restrictions  on  the 
architecture  which  could  be  used  to  organize  the  data. 

These  factors  influenced  the  hierarchical,  or  tree-based  approach  to 
data  management.  Data  records  are  assembled  into  a  collection  of  trees,  some 
being  root  records,  and  all  others  having  a  unique  parent  record.  This  organi- 
zation is  amenable  to  the  simple  relationships  of  employee  to  wage,  tax  code, 
dependents,  etc.  in  a  payroll  scenario,  or  the  assembly  to  subassemblies  to 
parts  relation  of  an  inventory.  Since  the  processing  is  repetitive,  and  need  not 
be  done  in  real  time,  hierarchic  database  programs  can  be  optimized  to 
navigate  through  the  tree  structure  even  when  this  is  highly  complex. 
Finally,  the  hierarchic  data  model  was  suited  to  magnetic  tape  storage,  an 
economic  requirement  before  random  access  disk-based  storage  became 
affordable. 

Evolutionary  modification  of  the  hierarchic  data  architecture  led  to 
the  Conference  on  Data  Systems  Languages  (CODASYL)  standard.  This 
arrangement  allows  more  complex,  and  thus  more  useful,  relationships  be- 
tween data  elements  to  be  represented,  with  records  arranged  into  a  directed 
graph  or  network.  Efficient  implementation  of  the  network  organization  both 
required  and  exploited  the  more  flexible  capabilities  of  direct  access  storage 
media.  Disks  rapidly  replaced  tape  as  their  cost-performance  ratio  improved. 
Application  programming  for  a  network  database  requires  a  highly  proce- 
dural navigation  oriented  language,  like  the  hierarchic  model,  which  restricts 
the  degree  of  dynamic  processing  changes  available  to  the  end  user. 
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The  relational  data  model  was  pioneered  in  the  early  nineteen 
seventies  and  offered  a  fundamentally  different  approach  to  data  storage. 
Data  is  represented  as  simple  tabular  structures  (relations),  and  access  is 
allowed  through  a  high  level,  non  procedural  query  language.  The  complexity 
of  relationships  between  data  elements  is  unrestricted.  The  application 
programmer,  or  end  user,  specifies  a  predicate  which  identifies  the  desired 
record  or  combination  of  records,  and  the  DBMS  translates  that  specification 
into  an  efficient  algorithm  which  performs  the  database  access. 

Even  the  most  advanced  relational  models  are  not  without  draw- 
backs, however.  The  computational  complexity  of  solving  queries  involving 
multiple  large  relations  can  be  prohibitive,  and  much  research  has  gone  into 
the  optimization  of  relational  queries.  Efficient  design,  or  normalization,  of 
the  relations  themselves  to  eliminate  redundancy  and  logical  anomalies  has 
also  required  theoretical  advances.  New  approaches  to  allocation  and 
management  of  disk  space  and  memory  buffering  routines  have  been  neces- 
sary to  minimize  storage  cost  and  access  delays.  While  the  relational  data 
model  provides  the  maximum  flexibility  in  organizing  and  manipulating  data 
in  the  early  nineties,  it  does  so  at  some  cost. 

The  evolution  of  theoretical  work  on  data  storage  and  processing, 
and  the  hardware  development  which  facilitated  and  paralleled  it  represent 
the  technical  factors  which  lead  to  heterogeneity  in  databases.  As  applica- 
tions were  developed  and  brought  into  production,  organizational  pressures 
prohibited  continuous  re-engineering  of  applications  to  exploit  each  new  theo- 
retical or  hardware  development,  even  where  that  was  appropriate.  It  must 
be  kept  in  mind  that  some  degree  of  heterogeneity  in  organizational 
databases  is  not  an  aberration  which  can,  or  even  should,  be  completely 


eliminated.  A  relational  database  offers  the  flexibility  to  deal  with  arbitrarily 
complex,  unstructured  queries  on  an  ad  hoc  basis,  but  its  computational 
overhead  does  not  recommend  it  for  a  mature  inventory  system.  When  update 
processing  requirements  are  relatively  static  and  well  understood,  transac- 
tions against  the  inventory  can  be  done  periodically,  and  the  number  of 
records  is  very  large,  a  hierarchic  database  is  a  better  choice.  Over  the  range 
of  organizational  activities  within  the  DoD,  various  problem  domain  solutions 
will  naturally  fall  to  diverse  appropriate  data  model/hardware  combinations. 
2.      Semantic  Heterogeneity 

A  separate  causal  factor  leads  to  semantic  data  conflicts.  Generally, 
these  arise  from  variations  in  database  design  methodology  and  implementa- 
tion. The  technical  factors  discussed  above  concern  physical  application  level 
strategies  and  models.  The  hierarchic,  network  and  relational  data  architec- 
tures deal  with  how  individual  data  elements  are  organized,  physically 
linked,  retrieved,  and  manipulated  by  the  hardware  and  software  of  an  appli- 
cation. As  suggested,  technical  issues  lead  to  a  natural,  unavoidable  diversity 
in  organizational  databases,  based  primarily  on  processing  efficiency  within 
particular  problem  domains.  Methodological  factors,  on  the  other  hand,  result 
from  human  individuality,  differences  in  perception,  and  preferences.  They 
give  rise  to  heterogeneity  between  databases  addressing  the  same  functional 
application,  using  identical  hardware,  operating  systems,  data  models,  and 
DBMS  software.  Because  data  definition,  naming  conventions,  and  conceptual 
organization  are  inherently  subjective  issues,  semantic  conflicts  are  almost 
guaranteed  amongst  databases  developed  by  different  teams  in  the  absence  of 
strictly  enforced  strategic  design  guidelines. 
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Individuals  interpret  the  world  from  their  own  personal  perspec- 
tive. Organizations,  and  subdivisions  of  organizations,  have  similarly  diverse 
views  of  their  environment.  Items  of  interest,  which  become  data  elements, 
aggregations  of  elements,  which  become  records  and  logical  entities  in  organi- 
zational databases,  are  named,  defined,  and  organized  in  this  qualitative, 
subjective,  environment.  If  two  departments  of  the  same  company  undertook 
to  develop  personnel  databases,  without  specific  guidance  from  the  front  of- 
fice, it  would  not  be  surprising  to  find  different  names  for  similar  employee 
attributes,  identical  field  definitions  for  contradictory  elements,  or  even  com- 
pletely different  ways  of  structuring  the  problem.  This  is  a  predictable  and 
unsurprising  consequence  of  individual  and  organizational  differences.  It  is 
germane  that  the  type  of  conflict  described  could,  and  would,  arise  even  if 
central  guidance  was  provided,  but  was  restricted  to  mandating  a  particular 
hardware/DBMS  suite. 

Yet  this  is  exactly  how  many  organizations,  including  the  DoD, 
have  developed  their  database  applications  over  the  last  forty  years.  Until 
very  recently,  only  particular  hardware,  operating  systems,  or  DBMSs  have 
been  standardized  among  the  services  and  their  various  departments.  There 
was  still  no  strategic  guidance  which  provided  common  definitions,  naming 
conventions,  etc.  at  the  element  or  entity  level.  Thus  even  if  DBMS/platform 
conflicts  do  not  arise,  semantic  conflicts  remain  which  can  make  databases 
incompatible. 
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B.     THE  NEED  FOR  INTEGRATION  OF  HETEROGENEOUS 
DATABASES 

As  organizations  mature  in  the  use  of  information  technology,  the 
potential  benefits  of  consolidating  heterogeneous  databases  become  irre- 
sistible. Vital  corporate  information  is  captured,  stored,  and  available  to 
decision  makers  and  operational  functions  from  many  database  applications, 
but  incompatibilities  can  prevent  the  integration  of  data  from  different 
sources.  Elimination  of  data  redundancy,  to  achieve  cost  advantages,  means 
more  applications  must  share  compatible  data.  Data  accuracy,  critical  for 
high-risk  decisions,  can  be  enhanced  by  identifying  disjoint  data  among 
similar  databases  and  resolving  the  semantic  conflicts. 

The  need  for  standardization  of  data  management  has  been  recognized 
by  the  DoD  and  forms  a  central  part  of  the  Corporate  Information  Manage- 
ment (CIM)  initiatives.  Current  data  dictionary  efforts,  which  address  the 
problem  of  semantic  data  discrepancies  at  their  lowest  level,  hold  promise  for 
ameliorating  the  problem  in  future  applications.  There  is  also  an  urgent  need 
for  high  level  methods  to  allow  the  integration  of  information  in  currently 
existing  heterogeneous  databases. 

Two  approaches  have  been  identified  which  address  this  issue.  The 
multidatabase  approach  leaves  the  component  databases  in  their  native  form, 
but  provides  transparent  access  to  all  included  information.  Users  are  aware 
that  they  are  dealing  with  multiple  diverse  databases,  both  schematically, 
and  physically.  Alternatively,  the  federated  approach  consolidates  the 
component  databases  under  a  global  schema,  and  gives  both  location  and 
heterogeneity  transparency.  Users  interact  with  the  data  as  though  it  were  in 
a  single,  physically  contiguous,  logically  consistent  database.  Either  approach 
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requires  a  strong  logical  data  model  to  describe  multiple  individual  physical 
data  architectures.  The  next  chapter  addresses  a  suitable  integrating  model. 

Once  all  databases  of  concern  have  been  expressed  in  a  common  concep- 
tual schema,  semantic  conflicts  among  individual  data  elements  can  be  easily 
identified.  Chapter  IV  presents  three  heterogeneous  real-world  databases  and 
describes  the  process  of  transforming  them  into  equivalent  schemas  in  the 
common  integrating  model.  Chapter  V  develops  a  framework  of  semantic  het- 
erogeneity for  the  integrating  model.  The  framework  enables  the  classifica- 
tion of  semantic  data  conflicts  stemming  from  human  variation  in  method- 
ological implementation.  With  a  comprehensive  integrating  model,  and  a  tax- 
onomy for  identifying  semantic  heterogeneity  which  includes,  schematic  and 
data  conflicts,  possible  solutions  can  be  proposed.  This  is  the  subject  of 
Chapter  VI.  Conflicts  in  architecture  and  data  organization  which  arise  at  the 
Platform/DBMS  level  are  properly  addressed  by  the  detailed  implementation 
of  the  integration  effort.  The  resulting  consolidated,  reconciled  information 
can  be  accessed  through  appropriate  systems  to  provide  organization-wide 
use  of  existing  heterogeneous  databases. 
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III.  THE  ENHANCED  ENTITY  RELATIONSHIP  MODEL 

A-     DATA  MODELING 

When  a  database  application  is  developed,  the  segment  of  the  real  world 
to  be  modeled  is  analyzed  in  light  of  the  users'  requirements.  The  designers 
make  a  choice  about  the  conceptual  data  model  to  be  used.  The  choice  of 
model  is  governed  by  its  perceived  appropriateness  to  the  problem  domain, 
the  personal  preference  of  the  designers  and  their  familiarity  with  various 
methodologies.  Conceptual  modeling  is  done  at  a  very  general  level  of 
analysis,  and  has  only  marginal  impact  on  implementation  decisions.  The 
data  elements  and  arrangements  suggested  by  the  analysis  must  then  be 
formally  specified,  and  their  structure  and  behavior  defined  in  terms  of  the 
logical  model.  The  ultimate  physical  organization  of  data  (in  a  network,  rela- 
tions, etc.)  is  independent  of  the  logical  schema  used  for  design,  and  is  chosen 
as  a  function  of  processing,  access  requirements,  transaction  frequency,  and 
the  structure  of  the  resulting  schema. 

In  considering  heterogeneous  databases  with  a  view  toward  information 
sharing  and  consolidating  access,  the  original  logical  design  is  often  unavail- 
able, and  the  conceptual  model  used  unknown.  The  final  application  architec- 
ture may  provide  no  indication  of  the  conceptual  scheme  used  in  the  initial 
analysis.  A  logical  integrating  model  which  can  describe  multiple  diverse 
implementation  models  is  needed  to  subsume  the  heterogeneous  component 
databases  and  allow  them  to  be  expressed  in  a  consistent  schema. 
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Of  the  potential  candidates  for  an  integrating  conceptual  model,  the 
Entity-Relationship  (ER)  approach  stands  out  as  a  strong  candidate.  It  is 
semantically  rich,  conceptually  simple,  and  can  capture  arbitrarily  compli- 
cated relationships  between  atomic  elements  and  larger  groupings  of  infor- 
mation. It  is  widely  used  in  database  design  [Ref.  1],  and  offers  a  natural  and 
intuitively  understandable  way  of  displaying  information  and  real  world  rela- 
tionships. With  the  additional  semantic  expressiveness  provided  by  exten- 
sions to  the  ER  model  (referred  to  as  Enhanced  Entity  Relation,  or  EER), 
newly  popular  concepts  such  as  inheritance  can  also  be  defined. 

Although  sophisticated  renditions  of  EER  schema  become  diagrammati- 
cally  complex,  the  essential  representation  of  atomic  data  elements  as  con- 
nected attributes  which  describe  an  entity,  or  real  world  item  of  interest,  is 
fundamental.  Relationships  between  entities,  and  the  characteristics  of  the 
relations  (cardinality,  mandatory  participation,  etc.),  are  explicitly  defined 
and  represented  by  the  model,  making  it  simple  to  visually  interpret  an  ER 
schema.  The  ER/EER  data  model  is  one  of  the  most  widely  used  logical 
schemes  for  conceptual  database  design  [Ref.  2].  This  wide  acceptance,  as 
well  as  its  superior  descriptive  qualities,  make  it  the  most  appropriate 
integrating  model. 

1.      Top  Down  Modeling 

In  top  down  database  modeling,  the  user's  real  world,  or  the  portion 
of  interest,  is  analyzed  in  terms  of  data  requirements  and  relationships. 
Appropriate  data  types  are  defined,  and  the  information  is  arranged  in  logical 
groupings  which  meet  the  users'  needs.  At  this  level  of  modeling,  no  imple- 
mentation details  are  considered,  and  the  resulting  schema  is  easy  to 
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understand  and  verify  with  non-technical  users.  The  basic  tool  for  this 
process  is  the  conceptual  data  model. 

For  example,  consider  the  design  of  a  database  to  organize  informa- 
tion about  officer  personnel  for  the  Department  of  the  Navy.  The  user  has 
specified  that  the  information  of  interest  includes  basic  data,  such  as  name, 
rank  and  serial  number,  as  well  as  the  officer's  duty  assignment.  The 
designer,  using  the  EER  technique,  takes  these  requirements,  and  arranges  a 
conceptual  schema  which  represents  the  officer  as  an  entity,  defined  by  the 
attributes  of  name,  rank,  and  social  security  number.  Likewise,  the  unit  he  or 
she  is  assigned  to  is  shown  as  an  entity,  defined  by  a  unit  identification  code 
attribute.  The  relationship  between  the  officer  and  the  unit  is  also 
represented. 

The  user  also  provides  specifications  about  appropriate  data  types 
for  various  elements.  Name  might  be  most  usefully  defined  as  a  character 
string,  while  rank  is  desired  to  be  represented  by  some  arbitrary  code  which 
fits  into  the  user's  overall  information  processing  philosophy.  At  this  point, 
uniquely  defining,  or  key,  attributes  are  defined  for  entities  where  possible 
confusion  could  exist  between  two  sets  of  information.  This  could  occur  if  two 
officers  had  identical  names  and  ranks.  When  the  conceptual  schema  is 
complete,  the  user  confirms  that  the  information  and  arrangement  meets  the 
database  requirements,  and  implementation  proceeds  through  the  mapping 
of  the  conceptual  schema  to  a  DBMS,  and  design  of  physical  data  storage 
structures. 

2.      Bottom  Up  Modeling 

The  use  of  conceptual  design  techniques  such  as  the  EER  model  in  a 
bottom  up  manner  differs  in  that  the  purpose  is  not  to  capture  a  suitable 
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schema  from  real  world  information.  Instead,  the  intent  is  to  reverse  engineer 
a  conceptual  schema  from  an  existing  database  implementation.  Data  types, 
file  structures  and  attribute  definitions  have  already  been  designed  and 
implemented.  Transforming  the  low  level  database  implementation 
specifications  back  into  a  high  level  conceptual  schema  allows  analysis  of  the 
choices  made  in  arranging  the  original  data  requirements. 

More  important,  bottom  up  data  modeling  can  render  completely 
different  database  implementations  in  equivalent  form  for  comparison  and 
interpretation.  This  is  the  main  thrust  of  reverse  conceptual  modeling  in  this 
thesis.  The  EER  model  is  semantically  rich  enough  to  conceptually  represent 
many  existing  database  implementation.  The  EER  model  will  be  used  in 
Chapter  IV  as  a  common  model  to  transform  diverse  heterogeneous  databases 
into  equivalent  schema's,  for  analysis  of  potential  semantic  conflicts.  The 
following  sections  present  the  EER  concepts  and  the  diagrammatic 
conventions  used  in  this  thesis.  The  specific  EER  model  used  throughout  this 
thesis  is  taken  from  [Ref.  3].  The  closing  section  of  this  chapter  briefly 
describes  the  application  of  the  EER  modeling  concepts  to  bottom  up  analysis 
of  various  different  database  implementation  types. 

B.     EER  CONCEPTS 

The  Enhanced  Entity  Relationship  model  is  essentially  very  simple. 
Information  is  represented  by  entities,  which  are  described  by  attributes,  and 
associated  to  each  other  by  various  kinds  of  relationships.  With  intuitive 
extensions  of  these  three  simple  concepts,  arbitrarily  complex  views  of  the 
real  world  can  be  expressed  in  a  graphic  and  easily  understood  way.  Since  the 
use  of  the  model  here  is  not  to  capture  a  top  down  schema  from  beginning 
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user  requirements,  not  all  of  the  semantic  expressiveness  available  will  be 
described.  For  a  complete  examination,  the  reader  is  directed  to  [Ref.  3]. 
1.      Entities 

The  central  object  in  the  EER  model  is  the  entity,  which  represents 
a  real  world  'thing*  with  independent  existence.  It  may  be  something  with 
physical  existence,  such  as  an  officer,  or  a  concept,  like  a  security  clearance. 
Entities  are  described  by  properties,  which  are  real  world  facts  about  an 
entity.  They  are  also  associated  with  other  entities  to  capture  additional 
information. 

Entities  can  be  unique,  and  independently  defined,  or  they  can  be 
dependent  on  the  existence  of  another  entity.  Such  entities  are  referred  to  as 
'weak'.  A  security  clearance  entity  is  an  example  of  a  weak  entity,  since  in  the 
real  world,  it  doesn't  make  sense  to  think  of  that  entity  without  a  related 
officer,  who  holds  the  clearance.  Weak  entities  have  their  own  attributes,  and 
represent  important  real  world  concepts,  but  must  be  associated  with  an 
identifying  owner  to  have  meaningful  semantic  content. 

Figure  1  illustrates  an  entity  type. 


MEMBER 


Figure  1.  Entity  Type 

Figure  2  is  a  weak  entity  type,  having  no  useful  semantic  content 
without  an  identifying  relationship. 
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Figure  2.  Weak  Entity  Type 
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2.      Attributes 

Descriptive  facts  about  entities  are  called  attributes.  They  can  be 
simple,  single  valued  attributes,  such  as  a  social  security  number,  or  they 
might  be  multivalued,  or  even  made  up  of  other  attributes.  Composite  attri- 
butes make  it  possible  to  represent  data  which  may  be  handled  as  a  whole 
sometimes,  but  in  part  at  others.  An  officer's  name  might  be  a  composite 
attribute,  if  it  is  used  in  full  (Last,  First,  Middle)  in  some  instances,  and 
sometimes  in  part  (Last  only). 

A  critical  attribute  concept  is  that  of  the  key.  A  key  attribute  is  one 
which  uniquely  defines  the  entity  it  describes.  This  allows  distinguishing 
between  instances  of  an  entity  type  for  which  all  other  attributes  are  identi- 
cal. Social  security  number  is  a  very  common  key  attribute.  A  related  concept 
is  that  of  the  partial  key.  A  partial  key  attribute  uniquely  describes  a  weak 
entity  when  concatenated  with  the  key  of  the  weak  entity's  identifying  owner. 

A  final  very  useful  attribute  type,  is  the  derived  attribute.  This  rep- 
resents information  which  is  not  explicitly  captured  in  the  database,  but  may 
be  determined,  or  calculated,  from  related  information.  The  total  number  of 
officers  assigned  to  a  unit,  for  instance,  could  be  calculated  from  the  number 
of  related  officer  entities  for  each  instance  of  the  unit  entity.  Total  number 
assigned  could  then  be  assigned  to  the  unit  entity  as  a  derived  attribute. 

Figure  3  shows  how  a  simple  attribute  is  depicted  graphically. 


Figure  3.  Attribute 
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Figure  4  depicts  a  key  attribute,  and  a  partial  key  attribute. 

Key  Attribute  Partial  Key  Attribute 

Figure  4.  Key  Attribute  and  Partial  Key  Attribute 


Figure  5  illustrates  a  multivalued  attribute  (an  attribute  with  a 
single  meaning,  for  which  an  entity  might  have  multiple  instances). 
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Figure  5.  Multivalued  Attribute 


Figure  6  represents  a  composite  attribute. 


Figure  6.  Composite  Attribute 
Figure  7  shows  how  a  derived  attribute  is  diagrammed. 

Figure  7.  Derived  Attribute 
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Figure  8  shows  a  partial  completed  entity  with  its  descriptive 
attributes. 
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Figure  8.  Member  Entity 
Figure  9  represents  a  weak  entity  with  its  partial  key. 
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Figure  9.  SCRTY-CLEAR  Weak  Entity 

3.      Relationships 

The  third  basic  concept  in  EER  modeling  is  the  relationship.  This  is 
used  to  represent  associations  of  varying  types  between  entities.  An  officer, 
for  example,  could  be  related  to  a  unit  by  the  relationship  'Assigned  To'.  The 
completed  schema  then  makes  it  explicit  that  an  officer  is  assigned  to  a  unit, 
by  connecting  the  two  entities  with  a  relationship.  Weak  entities  are  associ- 
ated with  their  identifying  owners  by  an  identifying  relationship. 

Relationships  can  capture  a  very  large  range  of  semantic  meaning 
by  the  addition  of  relationship  cardinality.  Cardinality  refers  to  constraints 
on  the  relationship.  In  other  words,  if  every  officer  is  assigned  to  one  and  only 
one  unit,  this  is  denned  in  the  EER  schema  by  adding  a  cardinality  number  to 
the  relationship  in  the  direction  from  the  officer  to  the  unit.  Units,  logically, 
would  have  many  officers  assigned,  and  this  would  be  represented  by  an 
appropriate  cardinality  in  the  relationship  direction  from  the  unit  to  the 
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officer.  The  graphical  conventions  used  to  depict  cardinality  will  be  shown  in 
the  following  section,  and  their  usage  will  be  more  obvious. 

Additional  constraints  on  relationships  are  referred  to  as  partial 
and  total  participation.  This  can  be  visualized  by  considering  the  weak  entity 
example  above.  Since  a  security  clearance  has  no  semantic  meaning  without 
an  identifying  relation  to  an  owner  officer  entity,  security  clearance 
participation  in  that  relation  must  be  total.  In  other  words,  each  and  every 
instance  of  security  clearance  must  participate  in  the  identifying  relationship 
with  some  officer  entity,  or  it  cannot  exist  in  the  schema.  Conversely,  it  is 
possible  to  conceptualize  a  unit,  perhaps  newly  formed,  which  has  no  officers 
assigned.  This  allowable  meaning  is  represented  by  a  partial  participation 
relationship.  A  unit  entity  is  allowed  to  exist  without  necessarily  partici- 
pating in  a  relationship  with  a  particular  officer.  By  combining  participation 
and  cardinality  constraints  on  relationships,  any  conceivable  association  of 
entities  can  be  modeled  using  EER  techniques. 

Figure  10  is  an  example  of  a  simple  relationship. 
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Figure  10.  Relationship  of  AATY  to  MEMBER 

Figure  11  illustrates  the  identifying  relationship  between  a  weak 
entity  and  it's  identifying  owner.  The  double  diamond  around  the  relationship 
specifies  that  it  is  an  identifying  relationship.  The  double  line  connecting  the 
weak  entity  to  the  relationship  is  used  to  indicate  the  total  participation  of 
the  weak  entity.  This  is  a  condition  of  the  identifying  relationship,  but  not 
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restricted  to  this  relationship  type.  Any  relationship  type  can  be  constrained 
on  either  side  by  total  participation. 
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dg^      ^AC&      <|th^  _      ^r-^jgiro 

bJ —     member     r     :^3^> 


Figure  11.  Identifying  Relationship  of  SCRTY-CLEAR  to  MEMBER 

Figure  12  gives  examples  of  various  cardinality  constraints  on 
relationships.  The  cardinalities  are  read  in  the  direction  away  from  the  con- 
strained entity.  In  other  words,  for  the  cardinality  label  immediately  above 
the  connecting  line,  each  entity  is  related  to  one  and  only  one  instance  of  the 
other  entity  (one-to-one).  Next  above  shows  the  ENTITY  1  related  to  many 
ENTITY  2  (one-to-many)  Finally,  above  the  line,  is  an  example  of  many  enti- 
ties on  either  side  related  to  many  entities  (many- to-many). 


ENTITY  1 


ENTITY  2 


Figure  12.  Various  Cardinalities  of  Relationships 

Below  the  line  are  illustrated  more  complex  cardinality  constraints. 
These  are  read  identically,  in  terms  of  direction,  the  conceptual  extension 
being  the  range  defined  in  the  parentheses.  The  left  number  of  the  ordered 
pair  represents  the  lower  bound  on  participation,  and  the  right  number  the 
upper  bound.  Thus,  reading  the  example  immediately  below  the  line  depicts  a 
relationship  in  which  the  ENTITY  1  must  be  related  to  at  least  one,  and  may 
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be  related  to  any  number  of  ENTITY  2  Conversely,  ENTITY  2  is  related  to  one 
and  only  one  ENTITY  1.  The  remaining  cardinality  constraints  are  read  in  a 
similar  manner,  and  are  not  exhaustive. 
4.      Complex  Data  Organizations 

Complex  and  useful  data  organizations  such  as  those  becoming 
popular  in  Object  Oriented  analysis  are  represented  in  the  EER  model  by 
specialized  extensions  to  relationships.  Two  will  be  described  below.  The 
Generalization/Specialization  structure,  which  captures  the  concept  of  inheri- 
tance, and  the  aggregation  structure,  which  captures  the  whole-part  relation- 
ship concept. 

a.      Generalization/Specialization 

The  Generalization/Specialization  (Gen/Spec)  relationship  is 
used  to  model  a  schema  of  entities,  which  all  posses  common  attributes  as 
part  of  their  description,  but  which  for  some  subset  of  entity  instances, 
unique  attributes  define  logical  subclasses.  It  is  sometimes  referred  to  as  an 
IS-A(N)  relationship  (i.e.,  the  Specialization  entity  IS-A  Generalization  entity). 
A  simple  illustration  which  expands  on  those  used  above  is  to  consider  a 
personnel  database  containing  data  not  only  on  officers,  but  all  members  of  a 
given  service.  For  all  entities  representing  service  members,  a  large  number 
of  attributes,  such  as  name,  social  security  number,  etc.,  will  be  same.  That 
is,  all  members  will  possess  these  attributes.  Officer  members,  however,  will 
have  different  attributes  than  enlisted  members,  and  it  is  conceptually  ele- 
gant to  be  able  to  model  this  phenomenon  explicitly. 

This  is  done  with  the  Gen/Spec  relationship  which  connects  the 
generalized  entity  member,  to  the  specialized  entities  enlisted,  and  officer. 
Thus  for  a  given  instance  of  officer,  the  full  set  of  defining  attributes  consists 
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of  those  belonging  to  the  generalized  member,  in  addition  to  the  specific  sub- 
set of  attributes  which  define  the  specialized  officer.  The  officer  instance 
'inherits'  the  attributes  of  its  related  member  instance. 

Gen/Spec  relationships  can  be  extended  with  various  qualifica- 
tions, just  as  simpler  relationships.  Two  Gen/Spec  constraints  utilized  in  this 
thesis  are  those  of  total  participation  of  the  general  entity,  and  disjointness. 
Total  participation  represents  the  semantic  concept  that  each  and  every 
member  of  instance  of  the  general  entity  must  belong  to  one  or  more  of  the 
related  specialization  entity  types.  If  on  the  other  hand,  it  were  allowable  for 
a  general  entity  to  exist  independently  (that  is,  only  possess  the  generalized 
attributes),  the  general  entity's  participation  in  the  Gen/Spec  relationship 
would  be  partial. 

Disjointness  indicates  that  each  specialization  entity  must 
belong  to  only  one  specialization.  In  the  member  to  enlisted/officer  relation- 
ship, disjointness  is  enforced,  since  each  member  must  be  either  an  officer,  or 
an  enlisted.  Alternatively,  a  Gen/Spec  relationship  in  which  a  specialization 
entity  could  belong  to  more  than  one  specialization  would  be  an  overlapping 
type. 

Figure  13  diagrams  a  disjoint  Gen/Spec  relationship  with  the 
constraint  that  each  and  every  MEMBER  must  belong  to  either  the  MEMBER- 
OFR,  or  the  MEMBER-ENL  specialization.  Disjointness  is  represented  by  the 
small  'd'  in  the  relationship  circle,  and  total  participation  of  MEMBER  by  the 
double  line  connecting  MEMBER  to  the  relationship. 
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Figure  13.    MEMBER  to  MEMBER-OFR  Generalization/Specialization 
Relationship 


b.     Aggregation 

Aggregation  is  an  abstraction  concept  for  building  composite 
entities  from  component  entities.  This  can  be  thought  of  as  a  PART-OF  rela- 
tionship. For  instance,  Army,  is  PART-OF  Department  of  Defense.  This  is 
extremely  useful  in  EER  modeling  of  some  advanced  database  applications. 

C.     APPLICATION  TO  REAL  WORLD  DATABASES 

This  section  reviews,  in  general  terms,  some  of  the  conceptual  bottom  up 
modeling  techniques  applicable  to  transforming  existing  database  implemen- 
tations into  equivalent  EER  schemas.  In  some  ways,  reverse  conceptualization 
of  existing  data  organizations  is  simpler  than  top  down  design,  since  many  of 
the  structural  choices  have  already  been  made,  and  may  be  obvious.  On  the 
other  hand,  absent  the  original  design,  some  arbitrariness  is  inevitable,  and 
it  falls  to  the  re-engineering  analyst  to  make  logical  choices.  While 
randomness  of  data  structuring  is  a  danger  of  this  approach,  if  consistent 
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criteria  are  used,  the  purpose  of  the  modeling,  which  is  to  allow  methodical 
comparison  of  semantic  schemas,  will  be  fulfilled. 

1.  Flat  Files 

Conceptually,  a  flat  file  data  organization  can  be  simply  rendered 
as  a  single  entity  possessing  all  the  attributes  defined  in  its  record  structure. 
However,  this  approach  is  inelegant,  and  loses  much  of  the  semantic  content 
which  is  likely  represented  by  the  original  file  record.  Thus,  repeating  groups 
of  fields  are  sought,  and  extracted  as  separate  entities  with  appropriate  rela- 
tionships to  the  entity  suggested  by  the  major  category  of  the  record. 
Similarly,  if  a  record  has  fields  which  are  utilized  for  different  meanings 
depending  on  the  values  of  other  fields,  this  suggests  that  the  record  actually 
describes  a  Gen/Spec  organization,  and  is  so  mapped  to  the  EER  diagram. 
Accurately  recreating  the  cardinalities  of  relationships  is  the  most  difficult 
part  of  bottom  up  modeling,  since  these  constraints,  while  explicitly  repre- 
sented by  the  EER  schema,  are  in  general  enforced  at  the  implementation 
level,  and  often  are  not  included  as  part  of  the  available  database  definition. 
Additionally,  whether  or  not  a  particular  entity  is  'weak',  and  the  choice  of 
identifying  owner  for  those  which  are,  may  not  be  obvious.  In  these  cases,  the 
re-engineering  analyst  must  make  logical  and  consistent  choices  based  upon 
knowledge  of  the  information  domain. 

2.  Hierarchic 

Basic,  restricted  data  organizations  such  as  the  hierarchic,  or  tree 
based  structure,  can  be  fully  described  quite  simply.  The  relationship  be- 
tween entities  is  one-to-one/many,  and  the  presence  of  specific  attributes  may 
allow  the  collection  of  several  elements  into  attributes  describing  a 
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generalized  entity.  Layered  grouping  of  entities  as  'children'  of  other  entities 
requires  no  extension  of  the  concept. 

3.  Network 

The  network  model  builds  on  the  tree  architecture,  but  allows  addi- 
tional associations  between  entities.  This  arrangement  is  restricted  by  the 
condition  that  a  'child'  may  have  only  one  'parent'  of  a  given  type.  The  result- 
ing multiple  one-to-many  relationships  form  the  network,  for  which  the  model 
is  named.  Like  mapping  the  hierarchic  structure  to  an  EER  schema,  no  par- 
ticularly complicated  analysis  is  necessary,  other  than  choosing  appropriate 
cardinality  of  relationships,  if  this  is  not  expressly  defined  for  the  re-engi- 
neering analyst. 

4.  Relational 

Relational  data  structures  are  not  constrained  in  the  complexity  of 
connections  between  data  elements  and  entities.  But  in  modeling  the  rela- 
tional implementation  to  an  EER  schema,  most  often  one  may  proceed  from 
the  assumption  of  a  correspondence  between  tables  and  entities  Relationships 
are  explicitly  given  by  the  distribution  of  foreign  keys.  Cardinalities  may 
have  to  be  inferred,  as  in  the  previous  cases. 

5.  Object  Oriented 

Although  not  widely  available,  there  is  great  interest  in  commercial 
database  implementations  which  exploit  the  use  of  object  oriented  design. 
The  generalization,  aggregation,  and  inheritance  constructs  offered  by  the 
EER  model  are  powerful  and  desirable  data  ordering  concepts.  In  the  future, 
integration  of  heterogeneous  databases  including  Object  Oriented  implemen- 
tations will  use  these  descriptive  properties  of  the  EER  model  more  fully  than 
they  are  employed  here. 
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IV.  HETEROGENEOUS  DATABASE  SCENARIO 

A.     THREE  OFFICER  PERSONNEL  ADMINISTRATIVE  DATABASES 

Personnel  information  for  the  Department  of  Defense  is  currently  stored 
in  a  variety  of  separate  and  diverse  databases.  A  great  wealth  of  data  is  avail- 
able, but  is  maintained  by  different  organizations,  using  different  database 
management  systems  (DBMS),  design  philosophies,  and  hardware  platforms. 
Frequently,  data  that  span  across  several  databases  need  to  be  retrieved. 
Under  the  current  environment,  however,  integrating  the  total  information 
presents  many  difficulties. 

To  accomplish  the  integration,  the  many  conflicts  arising  between  the 
multiple  databases  must  be  resolved,  to  allow  global  querying  of  the  body  of 
data.  Platform  incompatibility,  such  as  that  between  diverse  Operating  Sys- 
tems, manufacturers'  physical  hardware  implementations,  etc.,  is  solvable, 
although  sometimes  at  great  cost  in  processing  resources.  At  the  level  of 
DBMS  heterogeneity,  programming  techniques  can  be  used  to  translate  a 
query  appropriate  to  a  relational  database  into  one  suitable  for  searching  a 
flat  file  structure.  At  the  semantic  level,  conflicts  of  meaning,  and  conceptual 
arrangement  of  data  must  be  reconciled. 

Of  the  three  levels  of  heterogeneity,  semantic  conflicts  are  the  most  diffi- 
cult to  resolve.  During  the  conceptual  design  of  a  database  application,  the 
meaning  and  structural  organization  imposed  on  real  world  information  of 
interest  fundamentally  influences  every  subsequent  use  of  that  stored  data. 
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Even  when  identical  DBMSs,  platforms,  operating  systems,  etc.,  are  consid- 
ered, many  conflicts  can  still  arise  due  to  the  different  meanings  assigned  to 
the  same  real  world  item  by  different  designers. 

This  chapter  examines  three  actual  administrative  databases  currently 
in  use  by  various  organizations  within  the  Department  of  Defense  to  main- 
tain information  on  commissioned  officer  personnel.  DBMS  and  platform  dif- 
ferences amongst  these  databases  will  be  ignored  except  where  these  issues 
influence  the  effort  to  identify,  and  classify,  semantic  conflicts. 
1.      Active  Duty  Military  Inventory  (ADMI) 

The  ADMI  database  is  maintained  by  the  Defense  Manpower  Data 
Center,  and  includes  data  on  all  active  duty  military  personnel,  both  officer 
and  enlisted.  It  is  a  tape-based  flat  file  database,  and  serves  primarily  to  pro- 
cess batch  transactions  for  various  reports  of  interest  to  the  manpower  office 
of  the  Secretary  of  Defense.  Information  on  Naval  commissioned  officers  is 
therefore  available  as  a  subset  of  the  records  of  the  ADMI  database.  A  partial 
database  specification  for  ADMI  is  presented  in  Appendix  A.  While  not  com- 
plete, in  terms  of  complete  data  definition,  the  level  of  detail  available  is 
representative  of  what  might  actually  be  available  during  the  course  of  inte- 
grating a  multidatabase  application.  Reasonable  assumptions  have  been 
made  as  to  exact  attribute  definitions,  in  some  cases  to  illustrate  a  particular 
point  of  potential  semantic  conflict. 

The  ADMI  database  stores  basic  information  of  interest  to  the  per- 
sonnel administration  function,  such  as  name,  rank,  social  security  number, 
and  date  of  birth,  sex,  race,  etc.  It  also  keeps  data  on  marital  status,  number 
of  dependents,  and  whether  a  member's  spouse  is  also  a  member  of  the  mili- 
tary. In  addition  to  these  facts,  the  ADMI  database  contains  an  extensive 
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number  of  statistical  elements  concerning  a  member's  status  on  original 
entry  to  military  service.  This  includes  height,  weight,  test  form  number, 
both  raw  and  adjusted  scores  for  the  Armed  Services  Vocational  Aptitude 
Battery  (ASVAB),  and  place  of  entry  into  the  service. 

2.      Officer  Personnel  Information  System  (OPINS) 

The  OPINS  database  is  maintained  by  the  Bureau  of  Naval  Person- 
nel to  track  commissioned  officer  assignment,  promotion,  and  qualification 
status.  Like  the  ADMI,  it  is  a  flat  file  database,  and  it  theoretically  contains 
the  entire  population  of  interest  for  this  scenario.  A  partial  specification  is 
presented  in  Appendix  B.  Similar  assumptions  as  to  attribute  definitions 
have  been  made,  but  in  both  cases,  attribute  names  have  been  taken  directly 
from  the  specification  as  listed  in  the  appendix. 

OPINS  stores  similar  common  personnel  information  to  that  in  the 
ADMI  database,  such  as  name,  rank,  sex,  etc..  The  data  reflects  important  dif- 
ferences in  the  OPINS  area  of  interest,  however.  It  contains  relatively  detailed 
data  about  an  officer's  educational  history,  both  civilian  and  military,  as  well 
as  the  military  qualifications  resulting  from  that  training.  The  officer's  pro- 
motion status  and  history  is  captured  very  explicitly,  including  year  group, 
precedence  number,  and  the  dates  of  accession  to  each  rank.  The  unit 
assignment  data  in  OPINS  differs  from  the  brief  essentials  kept  by  the  ADMI 
database  in  being  far  more  extensive.  Historical  assignments,  by  billet  num- 
ber, primary  and  collateral  duty,  dates  assigned,  and  projected  rotation  date 
for  the  current  assignment  are  maintained  for  each  officer. 
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3.      Inactive  Manpower  and  Personnel  Management 
Information  System  (IMAPMIS) 

The  IMAPMIS  database  is  maintained  by  the  Naval  Reserve  Force  as 
an  integrated  repository  of  information  on  all  members  of  the  Naval  Reserve. 
This  includes  both  officer  and  enlisted  reserve  personnel,  as  well  as  active 
duty  Naval  personnel  in  the  Training  and  Administration  of  Reserves  (TAR) 
field.  It  is  a  relational  database  and  is  the  most  recently  implemented  of  the 
three.  Partial  table  definition  for  IMAPMIS  is  presented  in  Appendix  C.  Fewer 
assumptions  at  the  attribute  level  were  required  in  analyzing  IMAPMIS,  as  the 
available  definition  is  far  more  complete  than  for  ADMI  or  OPINS. 

Like  the  ADMI  database,  and  the  OPINS,  the  IMAPMIS  maintains  the 
essential  administrative  data  needed  by  the  personnel  function  (name,  rank, 
pay  entry  base  date,  etc.).  It  also  stores  a  wide  variety  of  unique  information 
specific  to  the  Naval  Reserve  manpower  management  process.  This  includes 
reserve  unit  affiliation,  in  addition  to  mobilization  unit  assignment,  last  paid 
drill,  total  credited  drills,  whether  drills  were  voluntary  or  mandatory,  and 
retirement  points  accumulated.  The  training  data  captured  by  the  IMAPMIS  is 
also  the  most  extensive  of  the  three  systems,  including  the  information  avail- 
able in  both  the  OPINS  and  the  ADMI  databases,  as  well  as  data  elements  indi- 
cating reserve  officer  training  accomplished  by  enlisted  members,  reserve 
mobilization  training  evolutions,  and  service  experience  in  military  opera- 
tions. As  was  seen  in  the  variation  of  informational  content  between  the  ADMI 
and  OPINS  databases,  the  specific  facts  recorded  in  the  IMAPMIS  reflect  the 
different  area  of  interest  of  its  users. 

In  all  cases,  the  assigned  definitions  are  intended  to  be  realistic, 
and  consistent  with  the  design  of  the  database  in  question.  The  assumed 
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definitions  should  not  be  taken  as  representative  of  any  actual  data  definition 
in  use  for  the  given  database,  and  are  only  presented  for  the  purpose  of 
illustration. 

B.     SOURCES  OF  HETEROGENEITY 

1.  Database  Management  System  /  Platform 

It  is  obvious  that  the  three  (ADMI,  OPINS,  and  IMAPMIS)  have  differ- 
ent implementation  details.  While  ADMI  and  OPINS  may  in  fact  run  under 
identical  DBMSs,  hardware,  and  operating  system,  IMAPMIS  certainly  runs 
under  an  incompatible  DBMS,  and  has  a  different  hardware/operating  system 
combination.  Any  heterogeneity  this  situation  may  or  may  not  introduce  to 
the  multidatabase  scenario  under  discussion  is  not  germane  to  this  analysis. 
The  focus  of  this  analysis  is  the  effort  to  identify  and  resolve  the  semantic 
conflicts  which  are  present. 

2.  Semantic 

Since  the  three  databases  under  discussion  were  all  developed  and 
implemented  at  different  times,  by  different  organizations,  for  different  pur- 
poses, it  should  not  be  surprising  that  very  different  conceptual 
arrangements  have  resulted.  A  review  of  the  Defense  Manpower  Data  Center 
ADMI  database  reveals  a  very  different  area  of  interest,  for  instance,  than 
that  of  OPINS.  The  Defense  Manpower  Data  Center  is  concerned  with  issues 
such  as  total  military  end-strength,  allotment  of  personnel  resources  to 
budgetary  program  elements,  and  the  like.  OPINS,  on  the  other  hand,  being  a 
service-specific  database,  captures  a  very  different  set  of  data  for  a  given 
officer,  including  present  and  past  assignments  by  billet,  and  promotion  year 
group.  There  is  a  large  overlap  in  the  area  of  basic  information  (name,  rank, 
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SSN,  etc.),  but  it  is  obvious  that  the  designers  of  OPINS  were  interested  in  a 
different  view  of  the  commissioned  officer  than  that  presented  by  ADMI. 
IMAPMIS  data  overlaps  both  ADMI  and  OPINS,  and  additionally  captures 
information  of  specific  interest  to  the  personnel  management  of  the  Reserve 
force,  such  as  Reserve  unit  affiliation,  and  last  credited  drill  period. 

Besides  varying  areas  of  informational  interest,  the  three  database 
design  efforts  employ  very  different  naming  conventions.  ADMI  largely 
employs  plain  language  labels  for  data  elements  which  are  easily  understood. 
OPINS  uses  much  more  service-specific  language,  which  would  be  obvious  to 
someone  familiar  with  Navy  terminology,  but  perhaps  confusing  to  a  layman. 
IMAPMIS  follows  the  OPINS  terminology  closely,  but  since  it  is  described  in  a 
particular  DBMS  language,  the  entity  and  attribute  names  are  awkward  and 
not  always  easily  matched  to  their  corresponding  elements  in  ADMI  and 
OPINS.  This  results  in  a  great  deal  of  semantic  heterogeneity,  since  it  becomes 
an  important  issue  to  resolve  whether  each  designer  means  the  same  thing 
when  an  attribute  is  called  UNIT,  for  instance. 

C.  ATTEMPTING  TO  QUERY  THE  TOTAL  BODY  OF  DATA 

Information  on  the  population  of  interest,  Naval  Commissioned  Officers, 
is  contained  across  all  three  databases.  Frequently,  queries  that  span  the 
three  databases  need  to  be  answered.  For  example,  we  may  like  to  retrieve  all 
available  data  for  a  given  value  of  a  key  attribute,  such  as  Social  Security 
Number.  Obviously,  a  query  against  any  one  of  the  databases  cannot  ensure 
this.  Information  on  all  officers  may  not  exist  in  a  single  database.  For 
instance,  an  active  duty  officer  not  in  the  TAR  program  will  not  appear  in 


34 


IMAPMIS.  As  was  pointed  out,  different  attributes  representing  real  world 
items  of  interest  are  contained  in  different  databases. 

To  guarantee  no  loss  of  any  information  already  available,  we  must 
somehow  present  a  global  query  which  will  be  processed  against  a  global 
schema  that  represents  the  integration  of  the  three  databases,  and  return  the 
requested  information.  Even  when  this  is  accomplished,  the  further  problem 
of  conflicting  data  remains.  In  other  words,  due  to  differences  in  update 
times,  data  entry  errors,  etc.,  even  identical  attributes  for  the  same  officer 
may  contain  different  data  values. 

Therefore,  because  of  the  different  data  organizations,  naming  conven- 
tions, and  particular  information  available  in  each  database,  as  well  as  the 
situation  where  conflicting  data  represents  the  same  information,  there  must 
be  some  means  of  resolving  the  inevitable  semantic  conflicts  which  will  arise 
when  particular  attributes  are  returned. 

D.     INTEGRATION  STRATEGY 

To  allow  queries  that  span  several  databases,  a  federated  database 
approach  is  suggested.  Following  this  approach,  each  local  database  is  con- 
sidered a  logical  component  in  the  federation.  These  components  are  tied 
together  by  a  global  schema  that  represents  the  integration  of  the  local 
schemas.  To  accomplish  this  several  steps  are  necessary.  First,  each  local 
schema  is  transformed  into  an  equivalent  schema  in  a  semantically  rich 
common  data  model.  This  step  is  carried  out  in  the  following  sections  using 
the  Extended  Entity  Relation  (EER)  model,  applying  the  concepts  and  dia- 
grammatic conventions  covered  in  Chapter  III.  Second,  a  systematic  com- 
parison is   made   across   the   individual   equivalent   schemas   between 
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corresponding  entities,  and  attributes,  searching  for  potential  conflicts.  Third, 
after  resolving  semantic  conflicts,  the  local  schemas  in  the  common  data 
models  are  merged  to  form  a  global  schema.  Fourth,  an  additional  control 
component,  known  as  the  global  controller,  is  required.  The  global  controller 
maintains  the  definition  of  the  global  schema  and  acts  as  a  coordinator  and 
translator:  it  receives  a  global  query,  possibly  in  a  user  specific  language; 
translates  it  into  an  equivalent  query  on  a  common-model  global  schema; 
decomposes  and  translates  the  common-model  query  into  subqueries  to  the 
corresponding  local  database  sites  for  processing;  collects  the  results;  identi- 
fies and  resolves  data  content  conflicts;  reformats  the  result;  and  sends  it 
back  to  the  originating  site.  The  first  three  steps  of  this  process  are  covered  in 
detail  in  the  remainder  of  this  thesis.  The  theoretical  design  of  the  query  and 
resolution  components  of  the  global  controller  described  in  step  four,  above,  is 
related  to  the  levels  of  schematic  and  data  heterogeneity  covered  by  this 
analysis.  Chapter  VI  will  show  how  the  methods  of  semantic  conflict  resolu- 
tion developed  can  be  applied  to  the  design  of  the  global  controller.  The  spe- 
cific implementation  of  the  global  controller  deals  largely  with  the  levels  of 
DBMS  and  platform  heterogeneity  mentioned  earlier,  and  is  outside  the  scope 
of  this  study. 

Due  to  the  large  number  of  attributes  comprising  the  real  world  sample, 
this  analysis  extracts  a  representative  subset  of  attributes  from  each 
database.  This  subset  adequately  illustrates  the  methodology  employed. 
Similar  treatment  of  the  complete  ADMI,  OPINS,  and  IMAPMIS  schemas  would 
follow  identical  procedures. 

The  remainder  of  this  chapter  deals  with  transforming  the  ADMI,  OPINS 
and  IMAPMIS  schemas  into  equivalent  EER  schemas.  Chapter  V  uses  these 
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diagrams  to  identify  a  comprehensive  set  of  potential  semantic  conflicts 
among  equivalent  EER  schemas  using  examples  from  the  three  databases. 
Chapter  VI  employs  this  classification  framework  to  suggest  potential  solu- 
tions for  each  type  of  conflict  and  complete  the  realization  of  a  comprehensive 
global  schema. 

1.      Translation  of  ADMI  Into  EER  Form 

Deriving  an  EER  diagram  from  the  ADMI  database  was  begun  by 
selecting  an  appropriate  subset  of  attributes  from  the  total  which  comprise 
each  ADMI  record.  The  specific  attributes  were  chosen  to  ensure  that  similar 
information  was  analyzed  from  each  database,  as  well  as  to  realistically  show 
the  differences  in  domains  of  interest.  Once  the  set  of  data  elements  was 
determined,  they  were  grouped  as  attributes  of  a  logical  arrangement  of  real 
world  entities.  These  entities  were  then  related  based  on  a  reasonable 
interpretation  of  the  conceptual  view  which  ADMI  is  attempting  to  represent. 

Since  ADMI  is  a  flat  file,  all  data  elements  it  contains  can  in  some 
sense  be  considered  simple  attributes  of  a  single  entity.  However,  certain 
analytical  standards  are  applicable.  The  repeating  set  of  fields  used  to  repre- 
sent LANGUAGE,  for  instance,  clearly  represents  a  multi-valued  composite 
attribute  which  is  appropriately  diagrammed  as  a  separate  entity.  Since  the 
ADMI  database  contains  information  on  all  active  duty  personnel,  fields  which 
take  on  different  values  depending  on  officer/enlisted  status,  and  specific  ser- 
vice membership,  can  be  diagrammed  as  defining  attributes  of 
Generalization/  Specialization  relationships.  This  is  how  the  relationship  of 
active  duty  member  to  service  member  to  specific  service  officer  is  modeled. 
The  shaded  entities  for  other  service  member,  and  naval  enlisted,  are 
included  in  the  diagram  only  to  indicate  the  structure  of  the  relationship,  and 
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are  not  populated  with  the  describing  attributes  they  would  possess  in  a 
complete  representation.  In  the  actual  implementation  of  ADMI,  there  would 
not  be  a  separate  instance  of  the  UNIT  entity,  since  it  is  merely  a  set  of 
attributes  of  the  member  record.  In  reverse  engineering  from  a  flat  file 
database  to  an  EER,  however,  it  is  proper  to  represent  UNIT  as  an  entity, 
having  existence  independent  of  its  relation  to  a  particular  member.  In  this 
way,  the  most  general  level  of  conceptualization  is  achieved.  This  is  analo- 
gous to  the  convention  which  would  be  followed  in  modeling  the  real  world 
top  down  to  an  EER  schema.  The  particular  relation  of  unit  and  member  in 
the  actual  ADMI  is  only  an  artifact  of  a  given  implementation  decision. 

Obviously,  some  of  the  results  of  the  flat  file  to  EER  translation 
shown  below  are  based  on  arbitrary  assumptions,  and  may  be  open  to  chal- 
lenge. The  process  detailed  here  is  representative  of  what  would  be  done  in  a 
more  rigorous  manner  if,  for  instance,  the  multidatabase  designer  had  access 
to  information  on  the  intentions  of  the  designers  of  the  original  database.  At 
the  conceptual  level  of  this  treatment,  the  effort  is  to  illustrate  the  procedure, 
and  ensure  that  all  the  various  potential  conflicts  are  enumerated.  While 
detailed  translation  of  the  ADMI  might  result  in  a  slightly  different  EER  dia- 
gram, it  is  not  felt  that  any  undue  artificiality  has  been  introduced  into  the 
example. 

The  entity  structure  extracted  from  the  ADMI  database  is  presented 
in  Appendix  D.  The  completed  EER  diagram  of  the  extracted  attribute  subset 
is  shown  in  Figure  14. 
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Figure  14.  EER  Schema  for  the  Active  Duty  Military  Inventory 
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2.      Translation  of  OPINS  Into  EER  Form 

An  identical  translation  process  was  performed  on  the  OPINS  flat 
file  database.  The  extracted  subset  in  this  case  resulted  in  a  substantially  dif- 
ferent EER  diagram,  though  very  similar  attributes  were  utilized.  This  points 
out  the  semantic  differences  which  arise  in  each  designer's  representation  of 
the  real  world.  The  entity  UNIT,  for  instance,  is  derived  from  a  repeating  com- 
posite attribute  in  the  OPINS  record,  and  is  diagrammed  as  a  separate  entity 
having  a  one-to-many  relationship  with  COMMISSIONED_OFFICER.  This  is  dif- 
ferent from  the  relationship  between  MEMBER  and  UNIT  in  the  ADMI  example, 
because  OPINS  actually  captures  a  historical  record  of  unit  assignments,  vice 
simply  the  current  one.  Likewise,  the  entity  YEAR_GROUP  has  no  matching 
construct  in  ADMI,  since  this  represents  information  of  interest  solely  to  the 
designers  of  OPINS. 

Similar  caveat  is  offered  regarding  the  exact  process  of  translation 
for  OPINS  as  was  true  for  ADMI.  No  claim  is  made  for  the  fidelity  of  the  EER 
diagram  as  translated,  relative  to  the  actual  real  world  view  intended  by  the 
OPINS  designers.  However  the  results  given  here  are  representative  of  the  use 
of  the  EER  process  and  model  to  formulate  a  bottom  up  conceptual  schema 
from  an  existing  database. 

The  entity  structure  extracted  from  the  OPINS  database  is  pre- 
sented in  Appendix  E.  The  completed  EER  diagram  for  OPINS  attribute  subset 
is  shown  as  Figure  15. 
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Figure  15.  EER  Schema  for  the  Officer  Personnel  Information  System 
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3.      Translation  of  IMAPMIS  Into  EER  Form 

Unlike  the  potentially  arbitrary  assumptions  required  in  translat- 
ing the  ADMI  and  OPINS  flat  files  to  EER  form,  the  conversion  of  IMAPMIS  is 
more  straightforward.  Since  IMAPMIS  is  a  relational  database,  in  most  cases 
there  is  a  simple  correspondence  between  the  IMAPMIS  tables  as  defined,  and 
the  entities  modeled.  Some  entities,  such  as  LANG,  are  not  specified  uniquely 
as  separate  tables  by  the  IMAPMIS  specifications,  though  they  are  referred  to 
as  individual  record  types.  Relationships  for  the  IMAPMIS  EER  diagram  are 
easily  derived  from  the  location  of  foreign  keys  within  the  tables. 

The  entity  structure  extracted  from  the  IMAPMIS  database  is  pre- 
sented in  Appendix  F.  The  EER  diagram  for  the  IMAPMIS  subset  as  translated 
is  shown  as  Figure  16.  The  shaded  entity  for  enlisted  member  is  included  as  a 
place  holder  only  to  indicate  the  structure  of  the  relationship,  and  is  not 
populated  with  the  describing  attributes  it  would  possess  in  a  complete 
representation. 
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Figure  16.    EER  Schema  for  the  Inactive  Manpower  and  Personnel 
Management  Information  System 
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V.  A  FRAMEWORK  FOR  SEMANTIC  HETEROGENEITY 

A.     CLASSIFYING  SEMANTIC  CONFLICTS 

With  the  candidate  databases  transformed  into  equivalent  EER  schemas, 
potential  semantic  conflicts  can  be  identified.  To  facilitate  the  classification 
and  resolution  of  semantic  conflicts,  a  framework  for  identifying  such  con- 
flicts is  developed  in  this  chapter.  The  classification  framework  presented 
here  recognizes  two  broad  kinds  of  conflict.  Schematic  conflict,  which  occurs 
at  the  level  of  the  conceptual  organization  and  definition  of  the  database,  and 
data  level  conflicts,  which  occur  between  the  actual  data  values  returned 
from  the  different  databases  by  a  query  against  the  global  schema. 

Procedurally,  the  individual  EER  schemas  are  matched  against  each 
other  in  a  top-down  fashion,  and  conflicts  as  they  are  noted  are  assigned  to 
sub-categories  of  the  schematic  division.  When  all  possible  schematic  conflicts 
have  been  classified,  a  more  speculative  analysis  of  possible  data-level  con- 
flicts is  conducted,  to  determine  potential  problems.  The  remainder  of  this 
chapter  presents  the  classification  framework  using  examples  resulting  from 
the  analysis  of  the  ADMI,  OPINS,  and  IMAPMIS  databases.  References  to  the 
assumed  detailed  data  definitions,  which  are  provided  for  extracted  attribute 
subsets  in  Appendices  G  through  I  for  ADMI,  OPINS  and  IMAPMIS,  respectively, 
are  intended  to  be  complete  enough  so  that  immediate  cross  checking  is  not 
required.  Analysis  of  the  appendices  will  reveal  many  potential  conflicts  not 
explicitly  shown  below.  Chapter  VI  offers  some  potential  resolution  strategies 
for  resolving  the  semantic  conflicts  between  the  candidate  databases,  and 
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completes  the  integration  of  a  global  EER  schema  which  would  be  used  to 
guide  the  formulation  of  queries  against  the  complete  body  of  data. 
1.      Schematic  Level  Conflicts 

As  mentioned  above,  this  type  of  conflict  arises  from  the  conceptual 
arrangement  and  definition  of  the  databases.  Since  all  three  databases  have 
been  represented  in  an  equivalent  EER  form,  the  process  of  identifying  these 
disparities  is  simplified.  Top-down  analysis  of  the  individual  database 
schemas  yields  three  subcategories  of  schema  level  conflict:  entity  conflicts; 
attribute  conflicts;  and  entity- attribute  conflicts.  Entity  level  conflicts  occur 
between  equivalent  entities.  Attribute  level  conflicts  specify  discrepancies 
among  like  attributes.  Entity- Attribute  level  conflicts  concern  differing  orga- 
nization of  data,  such  as  representing  the  same  information  as  an  attribute  in 
one  case,  and  as  an  entity  in  another. 

Each  subcategory  will  be  detailed  in  order,  with  examples  from  the 
three  databases  under  discussion. 

a.     Entity  Level  Conflicts 

Entity  level  conflicts  occur  when  like  real  world  entities  have 
differing  names  (synonyms),  or  differing  entities  have  identical  names 
(homonyms).  Entity  structures  as  represented  by  the  database  schema  may 
also  conflict.  A  third  entity  level  conflict  occurs  when  relationship  constraints 
between  entities  differs  across  two  or  more  schemas. 

(1)  Naming  Conflicts.  An  example  of  a  synonym  problem  is 
the  entity  COMMISSIONED.OFFICER,  in  the  OPINS  database,  contrasted  with 
the  equivalent  MEMBER-OFR  in  IMAPMIS.  Both  refer  to  instances  of  a  particu- 
lar commissioned  Naval  officer,  but  in  an  integrated  schema,  a  single  entity 
name  must  be  specified.  Similarly,  IMAPMIS  names  a  given  course  of  college 
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education  for  a  given  officer  EDUC,  but  OPINS  names  the  same  entity 
EDUCATION.  Relatively  obvious  dissimilarities  such  as  this  are  simple  to 
resolve,  but  all  kinds  of  complex  synonym  conflicts  can  occur  in  real  world 
cases. 

Homonyms  are  a  more  serious  problem,  because  in  this 
case,  different  real  world  entities  are  given  the  same  name.  Identification  of 
homonym  conflicts  requires  more  detailed  dissection  of  each  entity,  to  deter- 
mine its  actual  meaning.  When  completely  differing  concepts  are  captured  by 
like-named  entities,  this  must  be  rigorously  checked,  since  an  uncritical  map- 
ping of  the  two  into  a  single  entity  in  the  global  schema  will  give  a  meaning- 
less result.  An  example  of  homonyms  is  apparent  when  the  UNIT  entity  from 
ADMI  is  compared  with  UNIT  in  OPINS.  In  the  ADMI  database,  UNIT  refers  to 
an  instance  of  a  military  activity,  such  as  a  ship  or  squadron.  In  OPINS,  how- 
ever the  same  name  is  given  to  an  entity  which  is  actually  an  officer's 
assignment  to  a  given  billet,  at  a  given  unit.  It  is  obvious  that  even  though 
identical  names  are  assigned  to  these  entities,  a  very  different  semantic  con- 
tent is  represented  in  the  two. 

(2)  Entity  Structure  Conflicts.  This  is  caused  by  overlapping 
or  incomplete  attribute  sets  for  equivalent  entities.  This  can  arise  due  to  fail- 
ure of  one  database  to  include  certain  attributes  captured  by  another  because 
it  was  not  considered  of  interest.  Information  concerning  an  entity  might  also 
be  represented  by  the  attributes  of  other  entities  in  a  Generalization/Speciali- 
zation relationship. 

An  example  of  missing  attributes  is  found  in  UNIT  which 
in  the  OPINS  database  does  not  include  an  attribute  for  the  unit  Zip  code, 
while  UNIT  in  the  ADMI  does.  The  designers  of  the  OPINS  did  not  choose  to 


46 


store  this  particular  information.  Likewise,  LANGUAGE  in  the  ADMI  has  an 
attribute  for  listening  proficiency,  while  LANGUAGE_SKILL  in  OPINS  contains 
an  attribute  for  writing  skill.  Again,  this  results  from  differing  areas  of 
informational  concern  when  the  original  databases  were  designed. 

Overlapping  attributes  are  found  in  MEMBER-OFR 
(IMAPMIS),  which  does  not  contain  the  member's  name,  contrasted  with 
COMMISSIONED.OFFICER  (OPINS),  which  does.  This  is  due  to  the  General- 
ization/Specialization relationship  of  MEMBER-OFR  to  MEMBER  in  IMAPMIS. 
The  member's  name  is  represented  by  an  attribute  of  MEMBER.  Thus  the 
same  information  is  present,  but  at  a  different  level  of  the  schema.  Since 
OPINS  captures  information  on  a  more  limited  population  than  IMAPMIS,  the 
attributes  are  arranged  in  a  different  manner. 

(3)  Constraint  Conflicts.  When  the  cardinality  of  relationship 
between  two  entities  varies  across  two  or  more  schemas,  it  is  termed  an 
entity  constraint  conflict.  This  is  shown  by  the  n-to-1  relation  between  UNIT 
and  ACTIVE_DUTY_MILITARY_MEMBER  in  the  ADMI,  as  opposed  to  the  1-to-n 
relation  between  COMMISSIONED.OFFICER  and  UNIT  in  the  OPINS.  If  the 
structure  of  the  entities  manifesting  a  constraint  conflict  is  indeed  similar, 
this  again  indicates  a  basic  semantic  conflict  regarding  just  what  the 
databases  are  attempting  to  represent.  It  will  be  shown  that  in  this  particular 
instance,  the  constraint  conflict  actually  results  from  a  structure  conflict 
because  the  two  unit  entities  are  dissimilar.  However,  constraint  conflicts  are 
independently  a  valid  classification  of  semantic  heterogeneity. 

Another  type  of  entity  constraint  conflict  occurs  when 
there  is  a  difference  in  participation  requirements  for  equivalent  relation- 
ships in  two  databases.  An  example  of  this  is  given  by  the  partial 


47 


participation  of  MEMBER  in  the  'Has'  relationship  with  SCRTY-CLEAR  in  the 
IMAPMIS.  Contrast  this  with  the  total  participation  of  COMMISSIONED. 
OFFICER  in  the  'Certified  For'  relationship  with  SECURITY.REQUIREMENT  in 
the  OPINS.  This  type  of  conflict  arises  from  differing  views  of  the  informa- 
tional domain  by  two  groups  of  users.  Since  IMAPMIS  defines  the  relation- 
ship of  a  generalization  member  to  a  security  clearance  (e.g.,  all  members 
may  have  a  security  clearance),  the  participation  constraint  conflicts  with 
that  of  OPINS,  which  models  total  participation  (all  officers  must  be  certified 
for  one  and  only  one  security  requirement). 
b.      Attribute  Level  Conflicts 

Attribute  level  conflicts  cover  the  same  conceptual  range  as  the 
entity  level.  Attributes  representing  the  same  real  world  informational  ele- 
ment can  have  differing  names,  or  differing  attributes  identical  names. 
Attribute  structure  conflict  is  analogous  to  entity  structure  conflict.  Attribute 
constraint  conflict  differs  from  entity  constraints  since  it  is  due  not  to  rela- 
tionship cardinality  or  participation  constraints,  but  to  differences  in  the 
attribute  definition. 

(1)  Attribute  Name  Conflicts.  Like  entity  name  conflicts,  this 
category  comprises  synonyms  and  homonyms.  The  reasons  for  this  type  of 
conflict  are  the  same  as  for  the  entity  level.  Samples  of  attribute  synonyms 
from  the  databases  of  interest  are  DESIGNATOR  (OPINS)  and  DESIG 
(IMAPMIS),  as  well  as  ORIGINAL_SOURCE_CODE  (OPINS)  and  SOURCE_OF_ 
COMMISSION  (ADMI).  These  both  illustrate  identical  real  world  facts  called  by 
different  names. 

An  example  of  homonyms  is  UNIT_IDENTIFICATION_CODE 
(ADMI)  contrasted  to  UNIT_IDENTIFICATION_CODE  (OPINS).  These  two 
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identically  named  attributes  represent  different  real  world  facts.  The  ADMI 
captures  Department  of  Defense  wide  unit  identification,  while  the  same 
attribute  in  the  OPINS  is  actually  a  composite  attribute  made  up  of 
PARENT.UIC  and  ACTUAL.UIC  with  the  latter  attribute  corresponding  to 
UNIT_IDENTIFICATION_CODE . 

(2)  Attribute  Structure  Conflicts.  These  are  similar  to  entity 
structure  conflicts,  and  arise  from  information  being  represented  by  an 
atomic  attribute  in  one  database,  and  the  same  information  as  either  two 
separate  attributes,  or  part(s)  of  a  composite  attribute  in  another. 

Equivalent  information  is  captured  by  RACE_ETHNIC  in 
ADMI,  and  the  two  attributes  RACE  and  ETHNIC  in  the  OPINS.  This  case 
demonstrates  a  single  attribute  to  multiple  attribute  structure  conflict.  Alter- 
natively, the  real  world  value  of  an  officer's  warfare  designator  is  represented 
by  the  atomic  attribute  DESIG  in  the  OPINS,  while  the  IMAPMIS  database 
breaks  this  information  down  into  DESIG-CAT  and  DESIG-STAT,  which 
themselves  are  part  of  the  composite  attribute  DESIG. 

(3)  Attribute  Constraint  Conflicts.  Unlike  constraint  conflict 
at  the  entity  level,  attribute  constraint  conflict  occurs  due  to  the  detailed 
description  of  the  attribute  itself.  Thus  equivalent  real  world  facts  are  repre- 
sented by  attributes  which  have  different  data  definitions.  This  can  be  mani- 
fested as  type  clashes  (e.g.,  character  opposed  to  numeric),  length  clashes 
(e.g.,  larger  or  smaller  number  of  characters  in  a  given  field),  and  range 
clashes  (e.g.,  different  allowable  set  of  values  for  equivalent  facts).  Type  con- 
flict is  quite  common  when  dealing  with  databases  designed  for  different 
operational  implementations,  while  range  and  length  conflict  results  more 
from  semantic  design  choices. 
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An  example  of  type  and  length  clash  is  given  by  SOCIAL. 
SECURITY_NUMBER  (ADMI),  which  is  defined  as  a  4  byte  packed  integer, 
while  the  identical  information  is  defined  as  a  nine  numeric  integers  (which 
can  be  handled  as  a  string  by  modern  processing  techniques)  for  SSN  (OPINS). 
The  two  Year/Month/Day  attributes  DATE_OF_BIRTH  (ADMI)  and  DOB 
(IMAPMIS)  are  similarly  mismatched,  as  the  first  is  stored  as  a  3  byte  packed 
integer,  and  the  second  as  a  6  character  string. 

Allowable  value,  or  range,  clash,  is  also  illustrated  by  the 
two  date  attributes  just  noted.  In  the  IMAPMIS,  the  member's  date  of  birth  is 
defined  as  having  a  value  between  January  1,  1900  and  December  31,  1999. 
An  incompatible  range  is  defined  for  the  ADMI,  since  the  date  of  birth  in  this 
database  can  take  on  any  6  digit  value  which  corresponds  to  a  valid  date  (in 
other  words,  the  date  is  only  constrained  to  be  a  date,  and  could  represent  a 
value  outside  that  allowed  for  the  same  date  in  the  IMAPMIS). 
c.      Entity  Attribute  Conflicts 

Entity  attribute  conflicts  arise  when  equivalent  information  is 
represented  as  an  attribute  of  one  entity  in  a  given  database,  but  as  a  sepa- 
rate entity  in  another  database.  This  situation  arises,  like  other  structural 
semantic  conflicts,  because  of  conceptual  design  choices  concerning  the 
desired  organization  of  information.  A  particular  data  element  might  be  con- 
sidered to  be  part  of  the  aggregate  data  defining  an  entity  by  one  design 
team,  but  the  same  element(s)  might  be  considered  important  enough  to  set 
aside  as  an  independent  entity  by  another  team.  As  in  other  structural 
conflicts,  entity  attribute  conflicts  have  the  effect  of  placing  corresponding 
information  at  different  levels  of  the  schema. 
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An  example  of  this  is  the  member's  security  clearance  informa- 
tion, which  in  the  OPINS  and  IMAPMIS  databases  is  represented  as  separate 
entities;  SCRTY-CLEAR  in  the  IMAPMIS,  and  SECURITY.CLEARANCE  in  the 
OPINS.  The  equivalent  real  world  information  (though  less  detailed)  is  stored 
by  the  ADMI  as  the  composite  attribute  SECURITY_INVESTIGATION,  and  the 
atomic  attribute  SECURITY_CLASSIFICATION,  both  belonging  to  the  ACTIVE. 
DUTY_MILITARY_MEMBER  entity.  Different  views  of  the  real  world  bring 
about  these  differing  conceptual  arrangements  of  the  same  information. 

d.      Completed  Schematic  Level  Classification  Framework 

It  should  be  apparent  from  the  examples  give  above,  that 
multiple  simultaneous  conflicts  can  exist  at  any  level.  Entities  which  have 
synonym  conflicts  can  at  the  same  time  have  structural  and  constraint  mis- 
matches. Equivalent  attributes  are  often  subject  to  both  name,  and  structure/ 
constraint  conflicts.  The  value  of  the  classification  framework  presented  here 
is  that  it  provides  a  systematic  analytical  tool  for  the  identification  of  all 
schematic  conflicts. 

The  full  schematic  classification  portion  of  the  framework  is 
reiterated  in  Figure  17. 

2.      Data  Level  Conflicts 

The  full  enumeration  of  semantic  conflicts  must  also  account  for 
data  level  conflicts,  even  when  all  possible  schematic  conflicts  have  been 
identified  and  resolved.  This  is  because  even  identically  defined  and  named 
attributes  may  contain  actual  data  values  which  do  not  agree.  Data  level 
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1.  Entity  Level  Conflicts 
Naming  Conflicts 

Synonyms  (Same  real  world  entities  have  same  name  in  different  dBs.) 
Homonyms  (Different  real  world  entities  have  same  name  in  different  dBs.) 

Structure  Conflicts 

Different  attribute  sets 
Missing  attributes 
Overlapping  attributes 

Relationship  Constraint  conflicts 

2.  Attribute  Level  Conflicts 
Naming  Conflicts 

Synonyms  (Same  real  world  entities  have  same  name  in  different  dBs.) 
Homonyms  (Different  real  world  entities  have  same  name  in  different  dBs.) 

Constraint  Conflicts 

Type  clash.  (Equivalent  real  world  attributes  have  different  data  type 
definitions  in  different  dBs.) 

Ranee  clash.  (Equivalent  real  world  attributes  of  the  same  type  data 
have  different  allowable  range  definitions  in  different  dBs.) 

Structure  Conflicts 

(Equivalent  real  world  information  is  represented  as  a  single  attribute 
in  one  dB,  and  as  either  two  separate,  or  partes)  of  a  composite 
attribute  in  another.) 

3.  Entity  Attribute  Level  Conflicts 

(Equivalent  information  is  represented  as  an  attribute  of  an  entity  in 
one  dB,  and  as  either  a  separate  entity,  or  attribute(s)  of  a  Generalization/ 
Specialization  entity  structure  in  another.) 


Figure  17.  Framework  of  Conceptual  Schema  Level  Heterogeneity 

conflict  can  be  broken  down  into  two  main  types;  inconsistencies,  and  repre- 
sentation conflicts.  Inconsistencies  refer  to  the  case  where  two  equivalent 
values  for  an  identical  instance,  such  as  a  date,  or  rank,  do  not  agree  when 
the  results  of  a  query  are  returned  from  two  or  more  databases.  Data 
representation  conflicts  cover  a  much  more  diverse  spectrum  of  possible 
conflicts,  arising  from  dissimilar  expressions,  dissimilar  units,  and  dissimilar 
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precisions.  Incorporating  these  potential  data  conflicts  into  the  classification 
framework  completes  this  chapter,  and  results  in  a  valuable  methodological 
tool  for  complete  identification  of  semantic  heterogeneity. 
a.     Inconsistencies 

Inconsistencies  are  easily  conceptualized,  and  unfortunately 
very  common,  semantic  conflicts.  They  arise  from  the  real  world  process  of 
creating,  updating  and  maintaining  databases.  Different  update  times, 
human  data-entry  errors,  or  incorrect  data  submitted  to  be  stored  can  all 
produce  inconsistency.  An  inconsistency  results  when  one  database  returns  a 
given  value  for  a  specific  real  world  element  of  interest,  and  another  database 
returns  a  different  value  for  the  same  element.  This  conflict  is  independent  of 
any  schematic  naming  or  other  conflict.  While  simply  understood,  and  easily 
identified,  inconsistency  is  the  most  difficult  conflict  to  resolve.  Often  there  is 
simply  no  other  method  available  to  reconcile  an  inconsistency  except  to  go 
back  to  the  original  source  of  the  data  value,  and  determine  which  (if  any)  of 
the  conflicting  values  are  correct.  There  are  other  potential  ways  to  approach 
the  resolution  problem,  which  will  be  addressed  in  Chapter  VI,  but  none 
which  are  guaranteed  to  provide  a  general  solution. 

A  simple,  and  obvious,  example  of  an  inconsistency  is  the  ADMI 
database  returning  a  PAY_GRADE  of  4,  corresponding  to  04,  or  Lieutenant 
Commander,  for  a  given  commissioned  officer,  while  the  OPINS  returns  a 
value  of  3  for  the  attribute  PRESENT_GRADE,  indicating  a  rank  of  Lieutenant. 
One  of  the  two  is  incorrect,  since  an  officer  only  holds  one  rank  in  the  real 
world.  Techniques  for  determining  which  value  to  use  will  be  presented  in 
Chapter  VI. 
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b.      Data  Representation  Conflicts 

Data  representation  conflicts  occur  when  incompatible  sym- 
bols, units  of  measurement,  or  degrees  of  precision  are  used  to  store  equiva- 
lent data  elements.  In  general,  this  is  due  to  design  choices  at  the  conceptual 
level  caused  by  differing  areas  of  interest,  or  levels  of  concern,  about  given 
real  world  information  on  the  part  of  the  database  designers.  One  organiza- 
tion may  wish  to  have  very  specific  and  precise  information  about  an 
attribute  of  interest,  while  another  organization  might  be  satisfied  with  a 
general  categorization  of  the  same  data.  Alternatively,  one  design  team  may 
be  accustomed  to  dealing  with  coded  references  to  external  look  up  tables  to 
represent  values,  while  another  set  of  designers  prefer  to  more  explicitly 
represent  values  with  characters.  The  physical  implementation  details  of  the 
hardware  in  use,  and  the  individual  processing  procedures  of  the  DBMS  also 
influence  the  occurrence  of  data  representation  conflicts. 

(1)  Dissimilar  Expressions.  Dissimilar  expression  conflicts 
come  about  when  two  or  more  databases  use  the  same  type  of  data,  but  the 
values  stored  in  the  attribute  have  different  meanings.  For  example,  equiva- 
lent information  might  be  represented  by  different  character  strings.  An 
instance  of  this  is  ACTIVITY_TITLE,  a  character  attribute  which  in  OPINS 
represents  the  UNIT's  text  name,  such  as  COMSURFRON  THREE'.  Contrast  this 
to  the  attribute  ACTY-LANG-NAME,  also  a  character  attribute,  which  IMAPMIS 
uses  for  the  same  information.  The  actual  string  stored  in  this  attribute  for 
the  equivalent  unit  might  be  CMDR,  SFC  SQDRN  3".  Thus  given  character 
strings  returned  from  the  two  databases,  may  or  may  not  have  the  same 
meaning. 
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(2)  Dissimilar  Units.  Dissimilar  unit  conflicts  are  caused  by 
the  storage  of  information,  particularly  absolute  or  relative  measurements,  in 
attributes  with  the  same  type,  and  length,  and  range,  but  with  allowable 
values  defined  in  different  units.  In  the  analysis  of  the  administrative  ADMI, 
IMAPMIS,  and  OPENS  personnel  databases,  examples  of  this  particular  type  of 
conflict  are  rare,  since  few  measurements  are  maintained.  One  illustration  is 
the  UNIV-DUR  attribute,  in  the  IMAPMIS  database,  which  represents  a  2  char- 
acter value  for  the  length  of  an  officer's  course  of  instruction  in  weeks.  This 
choice  of  units  comes  about  through  a  domain  analysis  which  indicates  that 
the  population  of  interest  (Naval  Reserve  commissioned  officer  personnel)  are 
likely  to  take  shorter  courses  as  opposed  to  longer  courses  pursued  by  active 
duty  personnel.  On  the  other  hand,  the  DURATION  attribute  in  the  OPINS  is 
also  two  characters  (although  stored  as  numeric  integers),  but  represents  the 
length  of  a  course  of  instruction  in  months.  If  an  attempt  is  made  to  match 
these  two  values,  a  dissimilar  units  conflict  will  occur.  The  value  20,  returned 
from  both,  would  mean  both  20  weeks,  and  20  months,  respectively. 

(3)  Dissimilar  Precisions.  This  type  of  data  level  conflict  is 
due  to  real  world  information  being  specified  at  the  attribute  level  in  different 
degrees  of  precision.  In  other  words,  the  same  value  returned  from  two  or 
more  databases  has  a  different  meaning  because  an  identical  range  is  subdi- 
vided with  different  levels  of  granularity.  Consider  READING.PROFICIENCY 
from  the  ADMI  database.  This  1  character  attribute  is  constrained  to  the 
numeral  values  of  zero  through  nine,  with  nine  being  defined  as  fluent,  and 
zero  as  unacceptable,  with  eight  gradations  completing  the  allowable  values. 
This  provides  the  DMDC  very  precise  information  on  the  foreign  language 
reading  ability  of  personnel  in  the  database.  The  OPINS  definition  for 
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SKILL_READ,  however,  while  it  is  also  a  1  character  attribute,  groups  the 
allowable  ten  numeral  range  into  four  sub  ranges,  from  poor,  to  outstanding. 
Obviously,  although  the  two  attributes  store  equivalent  information  in  identi- 
cal formats,  the  values  from  OPINS  cannot  be  considered  to  give  an  identical 
level  of  detail  as  those  from  ADMI,  since  within  sub  ranges  any  value  will  re- 
sult in  one  of  the  four  broad  categories  being  returned  as  a  result  of  a  query. 
c.      Completed  Data  Level  Classification  Framework 

The  proceeding  data  level  conflicts  will  not  all  become 
apparent  in  the  process  of  integrating  a  multidatabase  from  a  set  of 
heterogeneous  databases.  Dissimilar  expressions  and  dissimilar  precisions 
may  or  may  not  be  identified,  depending  on  the  depth  of  description  available 
to  the  integration  effort  in  terms  of  detailed  data  definitions.  The  actual 
attribute  definitions  for  the  three  candidate  databases  were  not  considered  in 
this  study,  and  the  assigned  data  definitions  have  been  designed  to  illustrate 
each  of  the  possible  conflict  types.  This  is  representative  of  the  level  of 
analysis  required  to  identify  the  full  range  of  semantic  conflicts. 

Unfortunately,  data  inconsistencies  will  almost  certainly  not 
become  obvious,  until  data  from  global  queries  is  returned.  No  level  of  purely 
conceptual  analysis  will  be  able  to  preclude  wrong  data,  mismatched  update 
times,  or  data  entry  error.  Inconsistencies  are  included  in  the  framework 
because  they  represent  one  very  important  type  of  semantic  conflict,  albeit 
one  not  resolvable  by  the  conceptual  integration  effort. 

The  complete  data  level  classification  portion  of  the  framework 
is  reiterated  in  Figure  18. 
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1.     Inconsistencies 

(Equivalent  information  returned  from  different  dBs  disagrees  in  value.) 

2.     Data  representation  conflicts 

Dissimilar  Expressions 

(Equivalent  information  returned  from  different  dBs  is 

incompatible  values.) 

represented  by 

Dissimilar  Units 

(Equivalent  information  returned  from  different  dBs  is 
different  units.) 

expressed  in 

Dissimilar  Precisions 

(Equivalent  information  returned  from  different  dBs  is 
degrees  of  accuracy.) 

given  to  different 

Figure  18.  Framework  of  Data  Level  Heterogeneity 

B.     THE  SEMANTIC  HETEROGENEITY  FRAMEWORK 

The  fully  realized  framework  for  classifying  semantic  heterogeneity  can 
now  be  applied  to  any  set  of  existing  databases  which  have  been  transformed 
into  equivalent  EER  schemas.  By  using  a  systematic  approach  to  analysis  of 
each  equivalent  set  of  entities,  attributes,  and  relations,  all  possible  semantic 
conflicts  will  be  identified.  Of  course,  for  the  useful  integration  of  a  set  of 
heterogeneous  databases  into  a  global  schema,  these  conflicts  must  somehow 
be  resolved.  Chapter  VI  addresses  this  issue  in  a  general  way,  offering  some 
possible  solutions  for  each  category  of  semantic  conflict.  Applying  these 
methods  of  resolution,  the  three  administrative  databases  under  con- 
sideration will  be  integrated  into  a  coherent,  globally  addressable  EER 
schema.  The  specific  instrumentalities  of  resolving  each  type  of  conflict,  as 
well  as  a  rigorous  analysis  of  general  solutions,  is  left  to  future  research. 

The  complete  framework  for  semantic  heterogeneity  is  shown  below  as 
Figure  19. 
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Schematic  Level 

1.  Entity  level  conflicts 

Naming  conflicts 

Synonyms  (Same  real  world  entity  has  different  names  in  different  dBs.) 
Homonyms  (Different  real  world  entities  have  same  name  in  different  dBs.) 

Structure  conflicts 

Different  attribute  sets 
Missing  attributes 
Overlapping  attributes 

Relationship  Constraint  conflicts 

2.  Attribute  Level  Conflicts 

Naming  conflicts 

Synonyms  (Same  real  world  attribute  has  different  names  in  different  dBs.) 
Homonyms  (Different  real  world  attributes  have  same  name  in  different  dBs.) 

Constraint  conflicts 

Type  clash.  (Equivalent  real  world  attributes  have  different  data  type  definitions 
in  different  dBs.) 

Range  clash.  (Equivalent  real  world  attributes  of  the  same  data  type  have  different 
allowable  range  definitions  in  different  dBs.) 

Structure  conflicts 

(Equivalent  real  world  information  is  represented  as  a  single  attribute 
in  one  dB,  and  as  either  two  separate,  or  partes)  of  a  composite 
attribute  in  another.) 

3.  Entity  Attribute  Level  Conflicts 

(Equivalent  information  is  represented  as  an  attribute  of  an  entity  in 
one  dB,  and  as  either  a  separate  entity,  or  attribute(s)  of  a 
Generalization/Specialization  entity  structure  in  another.) 


Figure  19.  Complete  framework  for  Semantic  Heterogeneity 
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Data  Level 

1. 

Inconsistencies 

(Equivalent  information  returned  from  different  dBs  disagrees  in  value.) 

2. 

Data  representation  conflicts 

Dissimilar  Expressions 

(Equivalent  information  returned  from  different  dBs  is 

incompatible  values.) 

represented  by 

Dissimilar  Units 

(Equivalent  information  returned  from  different  dBs  is 
different  units.) 

expressed  in 

Dissimilar  Precisions 

(Equivalent  information  returned  from  different  dBs  is 
degrees  of  accuracy.) 

given  to  different 

Figure  19.  Complete  framework  for  Semantic  Heterogeneity  (Concluded) 
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VI.  SOLUTIONS  FOR  RESOLVING  SEMANTIC  HETEROGENEITY 

A.  GENERAL  APPROACH 

With  the  candidate  databases  for  integration  in  equivalent  schemas,  and 
all  potential  sources  of  semantic  conflict  identified  using  the  framework  pre- 
sented in  the  preceding  chapter  the  final  step  can  be  completed.  This  is  to 
consolidate  them  into  a  single  global  schema  which  can  be  used  to  guide  the 
formulation  of  queries  against  the  total  set  of  available  data.  Additionally, 
internal  design  considerations  of  the  global  controller  component  which  actu- 
ally manipulates  the  federated  database  are  developed  during  this  stage  of 
the  integration  process.  It  is  during  this  phase  that  conflicts  are  resolved, 
while  not  losing  any  information. 

The  spectrum  of  possible  solutions  to  identified  semantic  conflicts  ranges 
from  complete  redesign  of  a  new  integrated  database,  to  maintaining  the  sep- 
arate databases,  under  some  query  scheme  which  allows  them  to  be 
addressed  as  one.  This  federated  database  approach  was  described  in  Chapter 
IV,  and  this  chapter  presents  in  the  federated  database  context  some  very 
general  rules  which  can  be  used  to  resolve  the  conflicts  noted  in  Chapter  V. 
These  rules  apply  both  at  the  level  of  schema  integration  and  data  conflict 
resolution.  The  question  of  verifiably  correct  solutions  to  the  various  types  of 
semantic  conflict  is  a  rich  field  of  future  research  on  integrating  heteroge- 
neous databases. 

The  resolution  strategy  presented  here  proceeds  in  two  parts,  forming 
the  global  schema,  and  dealing  with  data  conflicts  which  are  returned  against 
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queries.  First,  schema  conflicts  between  the  local  schemas  in  the  common 
data  models  are  resolved,  allowing  them  to  be  merged  to  form  a  global 
schema.  This  guides  the  user  in  formulating  queries  against  the  total  body  of 
data.  Suggested  methods  for  choosing  the  structure  of  the  global  schema  are 
offered  below  for  each  type  of  conflict.  Design  of  the  global  controller  compo- 
nent is  guided  by  these  choices  in  the  processing  of  queries.  Second,  the  global 
controller  is  provided  with  the  complete  definition  of  the  global  schema, 
including  appropriate  means  of  mapping  from  the  global  schema  to  the  com- 
ponent databases,  as  well  as  the  information  needed  to  translate,  compare, 
and  resolve  the  various  data  conflicts  which  will  arise  when  data  is  returned 
from  a  global  query. 

The  design  of  the  global  controller  is  influenced  by  the  understanding  of 
semantic  conflicts  gained  during  the  re-engineering  process.  This  component 
deals  with  semantic  conflicts  during  query  processing  and  retrieval,  as  well 
as  resolving  data  level  conflicts  which  occur  when  inconsistent  data  is 
returned  for  the  same  real  world  item  of  interest  by  the  component 
databases.  During  querying  and  retrieval,  the  controller  must  know  how  to 
map  from  the  entity  and  attribute  names  chosen  for  the  global  schema  back 
to  the  actual  names  used  in  the  component  databases.  When  data  is  returned, 
the  controller  must  have  means  to  translate  various  attribute  definitions  into 
a  common  form,  compare  their  values,  and  if  possible,  resolve  data  level  con- 
flicts before  presenting  the  information  to  the  user.  In  both  these  aspects,  the 
re-engineering  analyst  uses  detailed  knowledge  of  the  semantic  conflicts 
existing  among  the  component  databases  gained  through  the  process 
described  in  this  thesis. 
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The  basic  assumption  of  this  chapter  is  that  all  available  information  is 
to  be  captured  in  the  global  schema.  In  other  words,  no  attributes  from  any 
database  are  to  be  excluded  if  they  provide  data  not  represented  elsewhere  by 
equivalent  attributes.  Where  data  is  duplicated,  the  rules  presented  below 
guide  the  choice  of  alternatives  for  inclusion  in  the  global  schema.  Thus  the 
union  of  attribute  sets  from  equivalent  entities  is  most  often  suggested,  which 
ensures  that  missing  attributes  from  any  one  database  are  not  lost. 
Heuristics  that  identify  which  of  several  redundant  overlapping  attributes 
may  be  safely  disregarded  complete  this  part  of  the  resolution  process. 

Another  underlying  assumption  is  that  data  included  in  the  global 
schema  should  be  represented  in  the  highest  level  of  definition  or  precision 
available.  Therefore  when  several  attributes  capture  equivalent  information, 
the  most  precisely  defined,  or  that  which  specifies  the  highest  available 
degree  of  precision  is  chosen  over  redundant  alternatives. 

A  final  general  comment  on  resolving  semantic  conflicts  is  that  in  many 
cases,  there  will  simply  be  no  other  choice  than  to  go  back  to  the  user.  This  is 
particularly  true  in  the  case  of  data  inconsistencies  as  will  be  noted  below. 
Re-examination  of  the  real  world  data  set  might  also  be  required  to  resolve 
cases  of  wrong  data,  though  there  are  rules  of  thumb  which  can  be  applied 
with  some  risk  of  error. 

The  following  section  restates  the  specific  semantic  conflicts,  by  type, 
which  were  used  as  examples  in  the  preceding  chapter.  Proposed  solutions  for 
each  type  of  conflict  are  offered,  with  estimates  of  effectiveness,  practicality, 
and  certainty  of  correct  resolution  where  appropriate.  The  completed  global 
schema  for  the  three  officer  personnel  databases  is  presented  at  the  end  of 
the  section  on  schematic  conflict.  This  is  followed  by  a  section  dealing  with 
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data  level  conflicts,  with  some  considerations  for  the  design  of  a  global  con- 
troller component  for  a  federated  database  application  approach  to  integrat- 
ing them. 

B.     PROPOSED  SOLUTIONS 

The  following  examples  duplicate,  for  consistency,  the  conflicts  by  type 
which  were  identified  and  classified  in  Chapter  V. 

1.      Schematic  Level  Conflicts 

Solutions  to  schematic  level  conflicts  generally  involves  renaming, 
combining,  or  redefining  entities  and  attributes  in  a  practical  way  to  ensure 
the  preservation  of  all  originally  available  semantic  content.  The  global  con- 
troller uses  name  mapping  and  look-up  tables  to  allow  decomposition  of 
queries  against  the  entity  or  attribute  name  chosen  for  the  global  schema 
back  to  the  component  databases.  With  the  possible  exception  of  constraint 
conflicts,  the  integrating  designer  having  a  clear  understanding  of  the  prob- 
lem domain  does  not  need  frequent  recourse  to  the  user  in  resolving  this  level 
of  conflict. 

a.     Entity  Level  Conflicts 

Naming,  structure,  and  constraint  conflicts  amongst  equiva- 
lent entities  is  resolved  by  suitably  renaming,  and  combining  attribute  sets  to 
form  consolidated  global  schema  entities.  Suitable  look-up  tables  are  included 
for  the  global  data  definition  in  the  global  controller  to  map  between  these 
global  schema  names  and  the  existing  names  utilized  at  the  component 
database  level.  Analyzing  the  original  semantic  intention  of  the  users  might 
be  required  to  resolve  some  entity  constraint  conflicts. 
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(1)  Naming  Conflicts.  An  example  of  a  synonym  problem  is 
the  entity  COMMISSIONED.OFFICER,  in  the  OPINS  database,  contrasted  with 
the  equivalent  MEMBER-OFR  in  IMAPMIS.  Similarly,  IMAPMIS  names  a  given 
course  of  college  education  for  a  given  officer  EDUC,  but  OPINS  names  the 
same  entity  EDUCATION.  For  obviously  equivalent  entities  such  as  these,  the 
more  fully  detailed  name  should  be  chosen.  Alternatively,  a  name  from  a 
standardized  data  definition  which  appropriately  describes  the  global  entity 
could  be  chosen. 

An  example  of  homonyms  is  apparent  when  the  UNIT 
entity  from  ADMI  is  compared  with  UNIT  in  OPINS.  In  the  AD  MI  database, 
UNIT  refers  to  an  instance  of  a  military  activity,  such  as  a  ship  or  squadron. 
In  OPINS,  however  the  same  name  is  given  to  an  entity  which  is  actually  an 
officer's  assignment  to  a  given  billet,  at  a  given  unit.  Homonym  conflicts  such 
as  this  usually  arise  because  of  inadequate  specificity  of  the  naming  conven- 
tions employed.  In  this  case,  the  UNIT  entity  in  OPINS  should  be  completely 
renamed  as  DUTY_STATION_BILLET_ASSIGNMENT  to  better  reflect  its 
intended  meaning,  with  only  those  attributes  which  relate  to  an  officers 
assignment  to  particular  billets,  current  and  historical.  Remaining  attributes 
of  the  OPINS  UNIT  entity  which  deal  with  the  specific  unit  currently  assigned 
will  be  included  with  the  global  unit  entity. 

It  is  appropriate  here  to  mention  the  concept  of  organiza- 
tionally standardized  Fully  Qualified  Naming  (FQN),  on  which  much  effort 
has  been  expended  recently.  FQN  certainly  reduces,  and  seeks  to  completely 
avoid,  semantic  conflict  between  data  element  names,  and  applies  equally  to 
entities  and  attributes.  FQN  specifies  the  semantic  meaning  of  a  data  element 
in  sufficient  detail  that  confusion  between  merely  similar  elements  is 
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eliminated.  Applied  to  the  homonym  example  above,  FQN  would  result  in  a 
name  such  as  that  suggested  for  OPINS,  which  more  accurately  indicates  the 
semantic  function  which  that  entity  fulfills  (a  record  of  an  officer's  billet 
assignments,  and  not  simply  information  about  the  unit  currently  assigned 
to).  Similarly,  FQN  for  ADMI  would  result  in  a  name  closer  to  CURRENT. 
UNIT.ASSIGNMENT.  This  is  a  very  over  simplified  treatment  of  the  theory  of 
Fully  Qualified  Names,  and  is  included  only  to  illustrate  the  current  thrust  of 
standardization  efforts  and  of  conventions  and  procedures  available  for 
resolution  of  this  kind  of  conflict.  Whatever  approach  is  taken,  the  global 
controller's  comprehensive  definition  includes  mapping  tables  to  allow 
decomposition  of  queries  against  global  schema  names  back  to  the  component 
databases. 

(2)  Entity  Structure  Conflicts.  An  example  of  missing  attri- 
butes is  found  in  UNIT  which  in  the  OPINS  database  does  not  include  an 
attribute  for  the  unit  Zip  code,  while  UNIT  in  the  ADMI  does.  Likewise, 
LANGUAGE  in  the  ADMI  has  an  attribute  for  listening  proficiency,  while 
LANGUAGE  in  OPINS  contains  an  attribute  for  writing  skill.  The  resolution  of 
missing  attribute  conflicts  is  simple.  The  union  of  attribute  sets  is  taken  for 
equivalent  entities,  which  ensures  that  all  originally  available  data  is 
included  in  the  global  schema.  In  resolving  one  conflict,  however,  the  intro- 
duction of  new  conflicts  should  be  avoided.  This  possibility  is  exemplified  by 
the  technique  of  taking  unions  of  different  attribute  sets,  which  solves  miss- 
ing attribute  problems,  but  which  may  raise  new  overlapping  conflicts. 

Overlapping  attributes  are  found  in  MEMBER-OFR 
(IMAPMIS),  which  does  not  contain  the  member's  name,  contrasted  with 
COMMISSIONED_OFFICER  (OPINS),  which  does.  This  is  due  to  the 
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Generalization/Specialization  relationship  of  MEMBER-OFR  to  MEMBER  in 
IMAPMIS.  The  member's  name  is  represented  by  an  attribute  of  MEMBER. 
This  is  solved  by  decomposing  COMMISSIONED.OFFICER  into  a  General- 
ization/Specialization structure,  segregating  the  appropriate  attributes  which 
apply  to  each  part  of  the  relationship.  Choosing  among  remaining  redundant, 
or  overlapping  attributes  after  this  entity  structure  conflict  is  resolved 
requires  more  analysis. 

Where  two  or  more  attributes  from  different  databases 
represent  truly  equivalent  data  elements,  the  attribute  with  the  most  fully 
detailed  name,  definition,  and  accuracy,  or  a  standardized  data  element,  if 
available,  should  be  chosen,  and  the  redundant  attributes  excluded  from  the 
global  schema.  Returning  to  the  example  of  overlapping  attributes  above,  the 
NAME  attribute  from  IMAPMIS  would  be  the  choice,  since  its  specification  is 
more  semantically  rich  than  either  of  the  other  two  name  attributes.  The 
global  controller  needs  in  its  detailed  definition  the  appropriate  look-up  tables 
to  match  the  chosen  global  entity  to  the  corresponding  attributes  in  the  com- 
ponent databases.  In  this  case,  the  more  fully  detailed  choice  is  intuitive  to 
the  user,  since  the  detailed  definition  of  the  IMAPMIS  NAME  attribute  sub- 
sumes the  definitions  of  the  other  two  with  no  loss  of  meaning. 

(3)  Entity  Constraint  Conflicts.  When  the  cardinality  of  rela- 
tionship between  two  entities  varies  across  two  or  more  schemas,  it  is  termed 
an  entity  constraint  conflict.  This  is  shown  by  the  one-to-one  relation  between 
UNIT  and  ACTIVE_DUTY_MnJTARY_MEMBER  in  the  ADMI,  as  opposed  to  the 
one-to-many  relation  between  UNIT  and  COMMISSIONED.OFFICER  in  the 
OPINS.  As  mentioned,  this  conflict  results  from  the  fact  that  UNIT  in  the 
OPINS  database  does  not  represent  an  equivalent  entity  to  UNIT  in  the  ADMI 
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database.  This  is  an  example  of  interdependency  of  conflicts,  where  one  type 
of  conflict  causes  another  conflict  of  a  different  type.  In  this  case,  solving  one 
(renaming  the  OPINS  UNIT)  will  also  resolve  the  other.  But  as  seen  in  the  case 
of  missing  attributes  under  entity  structure,  an  uncritical,  isolated  approach 
to  resolution  of  semantic  conflicts  can  become  a  circular  problem.  The  resolu- 
tion of  one  type  results  in  new  instances  of  a  different  type  of  conflict.  For  a 
true  cardinality  or  participation  constraint  conflict,  the  re-engineering  ana- 
lyst needs  to  use  the  constraint  that  reflects  the  actual  semantics  of  the 
application  area  of  interest.  Further  research  into  this  area  of  resolution 
strategy  is  suggested. 

b.      Attribute  Level  Conflicts 

Resolution  of  attribute  level  conflict  covers  the  same  concep- 
tual range  as  the  entity  level.  Appropriate  renaming,  and  inclusion/ 
elimination  of  missing  or  overlapping  attributes  can  successfully  deal  with 
naming  and  structure  conflicts. 

(1)  Attribute  Name  Conflicts.  Like  entity  name  conflicts, 
these  comprise  synonyms  and  homonyms.  Samples  of  attribute  synonyms 
from  the  databases  of  interest  are  DESIGNATOR  (OPINS)  and  DESIG 
(IMAPMIS),  as  well  as  ORIGINAL_SOURCE_CODE  (OPINS)  and  SOURCE.OF. 
COMMISSION  (ADMI).  These  both  illustrate  identical  real  world  facts  called  by 
different  names. 

An  examples  of  homonyms  is  UNIT_IDENTIFICATION_ 
CODE  (ADMI)  contrasted  to  UNIT_DDENTD7ICATION_CODE  (OPINS).  These  two 
identically  named  attributes  represent  different  real  world  facts.  The  ADMI 
defines  the  unit  identification  code  as  an  8-character  code  which  captures 
Department  of  Defense  wide  unit  identification,  while  the  same  attribute  in 
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the  OPINS  is  defined  as  a  standard  5-character  Navy  Unit  Identification  Code 
(UIC).  The  global  controller  will  have  look-up  tables  to  allow  mapping  between 
global  and  component  attributes  in  an  identical  manner  to  that  discussed  in 
the  above  section  on  entity  conflicts. 

(2)  Attribute  Structure  Conflict.  This  is  illustrated  by  RACE_ 
ETHNIC  in  ADMI,  and  the  two  attributes  RACE  and  ETHNIC  in  the  OPINS.  This 
case  demonstrates  a  single  attribute  to  multiple  attribute  structure  conflict. 
Alternatively,  the  real  world  value  of  an  officer's  warfare  designator  is 
represented  by  the  atomic  attribute  DESIG  in  the  OPINS,  while  the  IMAPMIS 
database  breaks  this  information  down  into  DESIG-CAT  and  DESIG-STAT, 
which  themselves  are  part  of  the  composite  attribute  DESIG.  The  suggested 
resolution  strategy  for  attribute  structure  conflict  is  to  capture  the  available 
information  at  the  finest  granularity  (i.e.,  using  the  largest  number  of 
attributes).  If  RACE  and  ETHNIC  contain  the  same  data  as  RACEJETHNIC, 
then  the  global  controller  will  decompose  that  query  into  the  two  atomic 
attributes.  The  same  holds  true  for  the  designator  information.  In  this  way, 
no  data  is  lost,  and  the  additional  flexibility  to  manipulate  the  available 
information  in  useful  ways  is  gained  over  using  the  single  combined  attribute. 

(3)  Attribute  Constraint  Conflicts.  An  example  of  type  clash 
is  give  by  SOCIAL_SECURITY_NUMBER  (ADMI),  which  is  defined  as  a  4-byte 
packed  integer,  while  the  identical  information  is  defined  as  a  9-character 
string  for  SSN  (OPINS).  The  two  Year/Month/Day  attributes  DATE_OF_BmTH 
(ADMI)  and  DOB  (IMAPMIS)  are  similarly  mismatched,  as  the  first  is  stored  as 
a  3-byte  packed  integer,  and  the  second  as  a  6-character  string. 

Allowable  value,  or  range,  clash,  is  also  illustrated  by  the 
two  date  attributes  just  noted.  In  the  IMAPMIS,  the  member's  date  of  birth  is 
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defined  as  having  a  value  between  January  1,  1900  and  December  31,  1999. 
An  incompatible  range  is  defined  for  the  ADMI,  since  the  date  of  birth  in  this 
database  can  take  on  any  6-digit  value  which  corresponds  to  a  valid  date  (in 
other  words,  the  date  is  only  constrained  to  be  a  date,  and  could  represent  a 
value  outside  that  allowed  for  the  same  date  in  the  IMAPMIS). 

To  resolve  both  type  and  range  clashes,  the  global  schema 
attribute  is  redefined  to  subsume  the  definitions  of  the  conflicting  attributes. 
This  strategy  is  a  very  rough  rule  of  thumb  at  best,  since  it  invites  instances 
of  inconsistent  data,  discussed  below.  The  global  controller  will  have  to  per- 
form the  translation  and  comparison  functions  described  below  to  deal  with 
the  potential  inconsistency. 

c.      Entity  Attribute  Conflicts 

An  example  of  this  is  the  member's  security  clearance  informa- 
tion, which  in  the  OPINS  and  IMAPMIS  databases  is  represented  as  separate 
entities;  SCRTY-CLEAR  in  the  IMAPMIS,  and  SECURITY.CLEARANCE  in  the 
OPINS.  The  equivalent  real  world  information  (though  less  detailed)  is  stored 
by  the  ADMI  as  the  composite  attribute  SECURITY.INVESTIGATION,  and  the 
atomic  attribute  SECURITY.CLASSIFICATION,  both  belonging  to  the  ACTIVE. 
DUTY_MILITARY_MEMBER  entity.  Resolution  of  this  type  of  semantic  conflict 
proceeds  by  removing  the  appropriate  attributes  from  the  entity  they  describe 
in  the  separate  database,  and  migrating  them  to  the  separate  entity  in  the 
global  schema.  (This  approach  assumes  that  the  global  schema  will  always 
represent  at  a  minimum  the  sum  of  independent  entities  from  the  separate 
databases,  taking  equivalence  mapping  into  account.  The  global  controller 
knows  where  to  find  the  equivalent  information  among  the  component 
databases,  even  when  the  individual  schemas  present  that  information  at 
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different  conceptual  levels  of  organization.  Thus  there  would  be  no  case  in 

which  the  attributes  of  an  existing  entity  would  be  migrated  to  a  higher  order 

entity  in  the  global  schema.  This  is  consistent  with  the  basic  philosophy  of 

representing  data  in  the  global  schema  at  the  finest  possible  granularity.) 

d.     An  Integrated  Global  EER  Schema  For  Three  Personnel 
Databases 

Applying  the  heuristics  and  suggested  resolution  strategies 
listed  above  results  in  a  global  EER  schema  for  the  Active  Duty  Military 
Inventory,  Officer  Personnel  Information  System,  and  Inactive  Manpower 
And  Personnel  Management  Information  System  databases.  This  schema  can 
then  be  used  to  guide  the  formulation  of  queries  against  the  total  original 
volume  of  data  available  across  all  three  databases.  Figure  20  shows  the 
completed  global  EER  schema. 
2.      Data  Level  Conflicts 

Data  level  conflicts,  which  include  inconsistencies,  and  data  repre- 
sentation conflicts,  present  a  much  more  difficult  resolution  problem.  Often 
the  only  choice  is  to  go  back  to  the  user,  or  recapture  the  original  data  from 
domain  of  interest.  These  conflicts  only  arise  when  data  is  returned  from  a 
query  against  the  federated  database.  The  global  controller  must  be  imple- 
mented with  a  capacity  to  deal  with  the  extraction,  conversion,  comparison, 
and  resolution  of  these  data  level  conflicts.  The  following  heuristics  can  be 
applied  to  the  design  of  the  global  controller,  but  with  the  understanding  that 
they  are  by  no  means  assured  of  correct  results. 
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Figure  20.  Integrated  Global  EER  Schema  for  ADMI,  OPINS  and  IMAPMIS 
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a.  Inconsistencies 

A  simple,  and  obvious  example  of  an  inconsistency  is  the  ADMI 
database  returning  a  PAY_GRADE  of  4,  corresponding  to  04,  or  Lieutenant 
Commander,  for  a  given  commissioned  officer,  while  the  OPINS  returns  a 
value  of  3  for  the  attribute  PRESENT_GRADE,  indicating  a  rank  of  Lieutenant. 
A  heuristic  would  be  to  accept  the  data  from  the  database  with  the  most 
recent  update.  This  will  not  guarantee  accuracy,  but  offers  a  simple  and  low 
effort  approach.  Alternatively,  other  data  might  be  available  to  cross-verify 
and  resolve  the  conflict  (i.e.,  pay  information  might  correspond  to  one  rank 
and  not  another,  or  the  DOR  could  be  checked  against  years  of  commissioned 
service,  to  see  if  one  rank  was  irrational).  An  important  qualification  of  this 
second  alternative  is  that  it  would  be  processing  based,  as  opposed  to  a 
schematic  resolution.  This  processing  would  be  included  in  the  detailed 
implementation  of  the  global  controller. 

b.  Data  Representation  Conflicts 

Dissimilar  expressions  can  often  only  be  resolved  by  accepting 
data  values  from  all  heterogeneous  databases  queried  by  the  global  schema, 
and  deciding  by  inspection  whether  the  information  is  equivalent,  and  which 
value  to  accept.  Alternatively,  an  automatic  resolution  might  be  built  into  the 
global  controller.  Such  a  solution  would  have  to  depend  on  large  and  ineffi- 
cient look-up  tables  covering  literally  every  conceivable  expression  which 
could  represent  the  equivalent  information  of  interest.  This  is  because 
expression  conflicts  cover  such  a  broad  spectrum  of  possibilities,  and  can  arise 
when  no  other  classifiable  conflicts  are  known  or  expected.  Additionally,  such 
a  mapping  scheme  would  necessarily  be  dynamic,  since  each  new  user  verified 
instance  of  an  equivalent,  though  conflicting,  expression  would  have  to  be 
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included  for  future  reference.  Dissimilar  units,  and  dissimilar  precisions 
admit  to  some  general  rules  of  thumb  for  resolution  which  are  noted  below. 

(1)  Dissimilar  Expressions.  An  example  of  this  is  ACTIVITY. 
TITLE,  a  character  attribute  which  in  OPINS  represents  the  UNTT's  text  name, 
such  as  COMSURFRON  THREE'.  Contrast  this  to  the  attribute  ACTY-LANG- 
NAME,  a  character  attribute,  which  IMAPMIS  uses  for  the  same  information. 
The  actual  string  stored  in  this  attribute  for  the  equivalent  unit  might  be 
'Cmdr,  Sfc  Sqdrn  3'.  Bearing  in  mind  that  the  more  richly  denned  attribute 
was  suggested  above  for  inclusion  in  the  global  schema  (in  this  case  ACTY- 
LANG-NAME)  the  expression  conflict  would  arise  when  the  value  from  OPINS 
was  returned  and  clashed  with  that  from  IMAPMIS.  Further  research  is 
required  to  resolve  this  kind  of  conflict  short  of  post  query  inspection  and 
addition  of  verified  equivalent  representations  to  the  look-up  table,  since  it  is 
a  result  of  purely  subjective  choice  as  to  appropriate  content. 

(2)  Dissimilar  Units.  This  is  illustrated  by  the  UNIV-DUR 
attribute,  in  the  IMAPMIS  database,  which  represents  a  2-character  value  for 
the  length  of  an  officer's  course  of  instruction  in  weeks.  On  the  other  hand, 
the  DURATION  attribute  in  the  OPINS  is  also  two  characters,  but  represents 
the  length  of  a  course  of  instruction  in  months.  This  kind  of  data  conflict  is 
amenable  to  the  FQN  approach  mentioned  above,  since  one  would  be  repre- 
sented as  UNTV-DUR-IN-WEEKS,  with  the  other  as  DURATION_IN_MONTHS.  To 
resolve  this  conflict  in  the  context  of  the  global  controller  for  a  federated 
database,  each  value  would  be  retrieved,  based  on  a  (possibly)  user-defined 
query  in  a  given  unit.  The  controller  would  accept  both  values,  translate  them 
into  a  common  unit,  compare  them  for  consistency,  and  return  the  informa- 
tion to  the  user  in  the  requested  units.  It  is  interesting  to  note  that  if  the 
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values  still  conflict  after  translation,  the  conflict  becomes  an  inconsistency, 
rather  than  a  dissimilar  units  conflict.  Additionally,  in  this  specific  case,  the 
only  time  an  inconsistency  will  arise  is  when  the  absolute  values  returned 
from  the  component  databases  match,  since  the  original  conflict  is  due  to 
their  difference  in  definition. 

(3)  Dissimilar  Precisions.  This  type  of  data  level  conflict  is 
shown  by  READING_PROFICIENCY  from  the  ADMI  database.  This  1-character 
attribute  is  constrained  to  the  numeral  values  of  zero  through  nine,  with  nine 
being  defined  as  excellent,  and  zero  as  unacceptable,  with  eight  gradations 
completing  the  allowable  values.  The  OPINS  definition  for  SKILL_READ,  how- 
ever, while  it  is  also  a  1-character  attribute,  groups  the  allowable  ten 
numeral  range  into  four  sub  ranges,  from  zero-one  meaning  poor,  to  eight- 
nine  meaning  outstanding.  In  this  case,  the  attribute  definition  with  the  finer 
granularity  should  be  chosen  for  the  global  schema  (to  capture  all  available 
information),  and  during  retrieval  the  less  precise  attribute  values  mapped 
onto  that  scale  by  means  of  a  look-up  table.  If  after  this  mapping,  the  values 
from  the  two  databases  still  do  not  agree,  the  conflict  devolves  to  an 
inconsistency,  as  noted  above. 


74 


VII.  CONCLUSIONS  AND  RECOMMENDATIONS 

A.  CONCLUSIONS 

Analysis  of  several  independently  developed  and  maintained  real  world 
databases  from  the  same  functional  area  shows  that  the  expected  heterogene- 
ity does  exist.  Three  levels  of  heterogeneity  can  be  recognized;  platform, 
DBMS,  and  semantic.  Of  these  three,  much  effort  has  gone  into  resolving  the 
technical  problems  of  making  a  global  query  against  databases  of  fundamen- 
tally different  organization.  Problems  such  as  formulating  a  relational  state- 
ment that  can  be  processed  by  a  CODASYL  based  DBMS  admit  to  technical 
solutions.  While  this  type  of  research  addresses  platform  and  DBMS  hetero- 
geneity, there  is  still  an  urgent  need  to  identify  and  resolve  semantic  con- 
flicts, or  differences  in  the  meaning  of  information  stored  in  existing  diverse 
databases. 

To  effectively  identify  and  classify  all  types  of  semantic  heterogeneity, 
data  organizations  must  be  expressed  in  a  common  schema.  The  Enhanced 
Entity  Relationship  model  is  an  appropriate  one  for  forming  an  integrating 
schema  of  heterogeneous  databases.  Because  it  is  semantically  rich,  and  has 
found  wide  use  in  initial  design  of  databases  (whatever  their  final  implemen- 
tation), it  is  a  useful  model  for  reverse  engineering  existing  applications  and 
transforming  them  into  equivalent  schemas. 

By  systematically  comparing  different  schemas  in  the  common  model, 
the  various  types  of  semantic  conflict  are  identifiable,  and  can  be  usefully 
grouped  in  a  framework.  A  large  part  of  the  semantic  conflicts  found  result 
from  arbitrary  and  undisciplined  application  of  naming  conventions  and  data 
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definitions  during  the  original  design.  This  framework  represents  a  powerful 
methodological  tool  for  the  analysis  of  any  set  of  heterogeneous  databases 
which  are  expressed  in  the  EER  model. 

The  major  weakness  noted  in  this  process  is  the  difficulty  of  correctly 
capturing  the  original  users'  intentions  regarding  relationship  constraints 
and  cardinalities.  It  is  felt  that  this  is  due  to  the  fact  that  although  relation- 
ship constraints  and  cardinalities  are  explicitly  represented  in  an  EER 
schema  (such  as  one  resulting  from  an  initial  top  down  design  effort),  these 
constraints  are  usually  enforced  at  the  implementation  level  through 
procedures  rather  than  being  captured  in  the  schema  itself.  It  is  unclear  that 
any  level  of  database  description  available  to  the  re-engineering  analyst, 
short  of  a  detailed  source  code  listing  of  the  actual  application,  will  allow  the 
original  relationship  constraints  to  be  conceptually  modeled  with  complete 
accuracy. 

The  process  of  exploring  possible  solutions  to  the  various  types  of  seman- 
tic conflict  reveals  that  a  wide  spectrum  of  techniques  apply.  Some  resolu- 
tions are  simple,  such  as  renaming  and  associated  look-up  tables,  and  provide 
certainty  of  a  correct  solution.  Other  types  of  semantic  conflict  are  extremely 
difficult  to  resolve,  particularly  data  inconsistencies.  While  recourse  to  the 
user,  or  re- examination  of  the  real  world  information,  will  certainly  deal  with 
these  problems,  a  more  complete  theoretical  approach  should  be  pursued. 

B.     RECOMMENDATIONS 

Current  Department  of  Defense  efforts  to  institute  Fully  Qualified 
Naming  (FQN)  principles  show  great  promise  for  eliminating  many  types  of 
semantic  conflict  identified  herein.  FQN  should  be  fully  enforced  for  all  new 
Department  of  Defense  database  applications. 
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FQN,  however,  will  primarily  benefit  newly  designed  databases.  There 
remains  a  need  for  an  integrating  model  to  support  the  integration  of  existing 
heterogeneous  databases  and  the  resolution  of  semantic  conflict.  This  inte- 
grating model  should  be  semantically  rich  enough  to  subsume  the  conceptual 
organizations  of  old  and  new  databases.  The  Department  of  Defense  should 
designate  a  suitable  conceptual  data  model  to  be  used  in  all  efforts  to  inte- 
grate existing  heterogeneous  databases,  and  develop  or  procure  the  support- 
ing tools  to  facilitate  integration  using  the  common  conceptual  model. 

C.  FUTURE  RESEARCH  EFFORTS 

FQN  will  not  solve  the  problems  of  semantic  heterogeneity  in  existing 
databases  (short  of  complete  redesign).  Therefore,  further  research  is  sug- 
gested in  the  area  of  general  solutions  to  resolving  the  types  of  semantic  con- 
flict identified  by  the  classification  framework.  In  particular,  interdependen- 
cies  of  conflicts,  some  of  which  were  noted  in  the  course  of  this  analysis, 
should  be  more  rigorously  investigated.  Efforts  to  resolve  semantic  conflict 
would  benefit  greatly  from  a  framework  similar  to  the  one  presented  here, 
which  could  enumerate  various  interdependences,  and  provide  assured  ways 
of  resolving  each,  without  introducing  new  conflicts. 

Additional  research  is  also  warranted  in  the  field  of  reverse  engineering 
and  the  development  of  conceptual  models  for  existing  implementations.  For 
example,  determining  relationship  constraints  and  cardinalities  from  existing 
specifications.  The  ability  to  accurately  capture  this  semantic  content  without 
recourse  to  a  detailed  analysis  of  DBMS  processing  algorithms  would  greatly 
enhance  the  usefulness  of  the  bottom-up  integration  strategy  suggested  by 
this  thesis. 
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APPENDIX  D 
ADMI  ENTITY  STRUCTURE 


Active  Duty  Master  Inventory  (ADMI) 

ACTIVE_DUTY_MILITARY_MEMBER  = 

SOCIAL_SECURITY_NUMBER  (key) 

+  NAME  (comp) 

+  DATE_OF_BIRTH  (comp) 

+  SEX 

+  RACE_ETHNIC 

+  ETHNIC.GROUP 

+  PAY_ENTRY_BASE_DATE  (comp) 

+  SERVICE 

+  MOS  (comp) 

+  DATE_OF_CURRENT_RANK  (comp) 

+  PAYGRADE 

+  SECURITY.CLASSIFICATION 

+  SECURITYJNVESTIGATION  (comp) 

+  EDUCATION  (comp) 


NAVAL_SERVICE_MEMBER  = 

+       NAVAL_SECURITY_INVESTIGATION_TYPE 
+       SERVICE  SPECIFIC 


NAVAL_OFFICER  = 

+       YEARS_OF_COMMISSIONED_SERVICE 
+       SOURCE  OF  COMMISSION 


UNIT  = 

UNIT_IDENTIFICATION_CODE  (key) 
+       DUTY.LOCATION 
+       UNIT_ZIP_CODE 
+       MAJOR_COMMAND_CODE 
+       PROGRAM_ELEMENT_CODE 
+       STRENGTH  ACCOUNTING  STATUS 
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LANGUAGE  = 

+  IDENTITY  (pkey) 

+  DATE.LASTTESTED  (comp) 

+  LISTENING.PROFICIENCY 

+  SPEAKING.PROFICIENCY 

+  READING.PROFICIENCY 

+  PROFICIENCY  SOURCE 
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APPENDIX  E 
OPINS  ENTITY  STRUCTURE 

Officer  Personnel  Information  System  (OPINS) 

COMMISSIONED.OFFICER  = 
SSN  (key) 

+  NAME  (comp) 

+  DATE_OF_BIRTH  (comp) 

+  SEX 

+  RACE 

+  ETHNIC 

+  PEBD  (comp) 

+  ACTIVE_COMMISSION_BASE_DATE  (comp) 

+  DESIGNATOR 

+  DOR  (comp) 

+  PRESENT.GRADE 

+  ORIGINAL  SOURCE  CODE 


UNIT  = 

UNIT_IDENTIFICATION_CODE  (comp)  (sub-attribute  is  key) 
+       HOMEPORT 
+       TYPE.ASSIGNMENT 
+       ACTIVITYTITLE 
+       BILLET_SEQUENCE_NUMBER  (key) 
+       DATE_ASSIGNED  (comp)  (key) 
+       FROM  (comp) 
+       TO  (comp) 


LANGUAGE.SKILL  = 
+       CODE  (pkey) 
+       PROFICIENCY_YEAR 
+       METHOD  (comp) 
+       SKILL  (comp) 


SECURITY.REQUIREMENT  = 
SECURITY.CODE  (pkey) 
+       SECURITY.AGENCY  ' 
+       SECURITYJNVESTIGATION.DATE  (comp) 
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YEAR.GROUP  = 

YEAR_GROUP_ID  (comp)  (KEY) 
+       PROJECTED_AUTHORIZED_PROMOTION_DATE  (comp) 
+       PROJECTED  AUTHORIZED  PROMOTION  GRADE 


EDUCATION  = 

COLLEGE_NAME  (pkey) 
+       YEAR_COMPLETED  (pkey) 
+       LEVEL 
+       DURATION 
+       MAJOR 
+       SPECIALTY 
+       SPONSOR 
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APPENDIX  F 
IMAPMIS  ENTITY  STRUCTURE 


Inactive  Manpower  And  Personnel  Managment  Information  Sysytem 
(IMAPMIS) 


MEMBER  = 

SSN  (key) 

+ 

MEMBER-NAME  (comp) 

+ 

DOB  (comp) 

+ 

SEXC 

+ 

RACE 

+ 

ETHN 

+ 

PEBD  (comp) 

+ 

OFCR-ENL-IND 

MEMBER-OFR  = 
+       GRD-CD 
+       SRCE-ORIG  (comp) 
+       DT-PRMTN  (comp) 
+       DESIG  (comp) 
+       BASE-DT-CMSN-SVC  (comp) 


EDUC 

+ 

UNIV-NAME  (pkey) 

+ 

UNIV-DT-CMP  (pkey) 

+ 

UNIV-SPNSR 

+ 

UNIV-DUR 

+ 

UNrV-LVL 

+ 

UNrV-MAJ 

+ 

UNrV-SPEC 

AATY 

+ 

ACTrV-UIC  (key) 

+ 

ACTY-LANG-NAME 

+ 

GEOGRAPHIC-LOC 

+ 

UNIT-ADRS  (comp) 

+ 

AATY-ATC 

+ 

PROG-ELEMENT-CD 
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SCRTY-CLEAR  = 

+       SCRTY-INVST-DT  (pkey)  (comp) 

+       SCRTY-AGCY 

+       SCRTY-INVST-TYPE 

+       SCRTY-CLR-AUTH 

+       SCRTY-CLR-AUTH-DT  (comp) 


LANG  = 

+       LANG-ID  (pkey) 

+       LANG-APRSL  (comp) 

+       LANG-METH-APRSL  (comp) 

+       LANG-DT-TESTED  (comp) 
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APPENDIX  G 
ADMI  DATA  DEFINITIONS 


Active  Duty  Master  Inventory  ADMI 

Attribute  Type  Length  Key 


Range 


ACTIVE  J)UTY_MILITARY_MEMBER  Entity 

SOCIAL_SECURITY_ 

NUMBER  NI  4  Y  0-9 

**Member's  Social  Security  Number  (in  4  byte  packed  numeric  format). 


NAME 
-LAST 
-FIRST 
-MIDDLE 


27 

15 

11 

1 


N 


A..Z 


**Member's  full  name  (in  Last,First,MI  {including  ","s)  format). 

DATE_OF_BIRTH  NI  3  N  0-9 

**Member's  date  of  birth  (in  YYMMDD  3  byte  packed  numeric  format). 


SEX  NI  1 

**Member's  sex  (l=Male,  2=Female). 

RACE.ETHNIC  C  1 

**Member's  Race  (Caucasian,  African,  etc.). 

ETHNIC.GROUP  C  1 

**Member's  ethnic  group  (special  code). 


N 

1,2 

N 

C,M,A,I,H 

N 

A-Z,  0-9 

PAY_ENTRY_ 
BASE  DATE 


NI 


N 


0-9 


Constraint 


Mandatory 


**Member's  start  date  for  calculation  of  time  in  service  for  pay  purposes  (in  YYMMDD  3  byte 
packed  numeric  format). 

SERVICE  NI  1  N  1,2,3,4  Mandatory 

**Member's  service  (l=Army,  2=Air  Force,  3=Navy,  4=Marines).  Defining  attribute  of 
SERVICE-MEMBER  Specialization. 


N 


A-Z,  0-9 


MOS  C  14 

-PRIMARY  7 

-DUTY  7 

**Member's  Military  Occupational  Specialty  (code,  both  the  MOS  gained  by  training 
{Primary},  and  for  the  current  assignment  (Duty)). 
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Attribute  Type  Length  Key  Range  Constraint 

DATE_OF_CURRENT_ 
RANK  NI  2  N  0-9 

**Member's  date  of  promotion  to  current  pay  grade  (in  YYMM  2  byte  packed  numeric 
format). 

PAYGRADE  NI  1  N  0-9 

**Member's  current  paygrade  (l=El/Ol,  2=E2/02,  etc.,  covering  grades  for  Naval  Officers 

from  Ensign  {01}  to  Admiral  {O10}). 

SECURITY. 
CLASSIFICATION  NI  1  N  0-9 

**Member's  security  clearance  (0=None,  l=Classified,  2=Secret,  3=Top  Secret,  etc.). 

SECURITY, 

INVESTIGATION  NI  3  N  0-9 

-TYPE  1 

-DATE_OF_ 
COMPLETION  2 

**Type  of  security  investigation  completed  for  member  (0=None,  l=National  Agency  Check, 
2=Background  Investigation,  3=Special  Background  Investigation,  etc.),  and  date  on  which  it 
was  completed  (in  YYMM  2  byte  packed  numeric  format). 

EDUCATION  NI     .  3  N  0-9 

-CODE  1 

-CERT  1 

-HIGHEST_YEAR  1 

**Member's  educational  data,  including  code  for  college  level  courses  (0=No,  l=Yes), 
certification  of  High  School  completion  (0=No,  l=Yes),  and  highest  year  of  schooling 
completed  (in  1  byte  packed  numeric  format). 


NAVAL_SERVICE_MEMBER  Entity 

NAVAL_SECURITY_ 

INVESTIGATION. 

TYPE  C  2  N  0-9 

**Special  security  investigation  information  required  for  Naval  service  members  (field  is  null 
for  other  services). 

SERVICE.SPECIFIC  C  2  N  0-9 

**Meaning  of  attribute  varies  according  to  member's  service. 


NAVAL.OFFICER  Entity 

YEARS_OF_ 

COMMISSIONED. 

SERVICE  NI  1  N  0-9 

**Officer's  total  years  of  commissioned  military  service  (in  1  byte  packed  numeric  format). 
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Attribute  Type  Length  Key  Range  Constraint 

SOURCE_OF_ 

COMMISSION  NI  1  N  0-9 

"""Officer's  commissioning  source  (0=Service  Academy,  l=ROTC,  3=Officer  Candidate  School, 
etc.). 


UNIT  Entity 
UNIT. 

IDENTIFICATION, 

CODE  C  8  Y  A-Z,  0-9  Mandatory 

♦♦Department  of  Defense  8  digit  Unit  Identification  Code  (includes  3  digit  service/component 
identification,  plus  5  digit  std  Navy  UIC) 


DUTY_LOCATION  NI  1  N  0-9 

**Unit's  geographic  location  (0=Continental  US,  l=Europe,  2=Japan,  3=Middle  East,  etc.). 

UNIT_ZIP_CODE  C  5  N  0-9 

**Unit's  5  digit  postal  Zip  code. 

MAJOR. 

COMMAND.CODE  NI  3  N  0-9  Mandatory 

**Unit's  assignment  to  major  force  command  (121=CINCPACFLT,  333=USAREUR, 
542=Sixth  FLT,  etc.). 

PROGRAM. 

ELEMENT.CODE  C  6  N  A-Z,  0-9  Mandatory 

**Unit's  budgetary  funding  program  element  code. 

STRENGTH. 

ACCOUNTING. 

STATUS  NI  1  N  0,1  Mandatory 

**Unit's  is  required  to  continuously  report  total  percentage  of  authorized  end  stregth  (0=No, 
l=Yes). 


LANGUAGE  Entity 

IDENTITY  CI  P  A-Z  Mandatory 

**Foreign  language  (F=French,  R=Russian,  M=Mandarin,  A=Arabic,  etc.). 

DATE.LAST. 

TESTED  NI  2  N  0-9 

**Date  on  which  the  language  proficiency  was  last  tested  (in  YYMM  2  byte  packed  numeric 
format). 

LISTENING. 

PROFICIENCY  NI  1  N  0-9 

**Level  of  apptitude  in  listening  comprehension  for  a  foreign  language  (0=Unacceptable, 
l=Very  Poor,  2=Poor,  3=Below  Average,  4=Average,  5=Above  Average,  6=Good,  7=Very  Good, 
8=Excellent,  9=Fluent). 
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READING. 

PROFICIENCY  NI  1  N  0-9 

**Level  of  apptitude  in  Reading  comprehension  for  a  foreign  language  (0=Unacceptable, 
l=Very  Poor,  2=Poor,  3=Below  Average,  4=Average,  5=Above  Average,  6=Good,  7=Very  Good, 
8=Excellent,  9=Fluent). 

SPEAKING. 

PROFICIENCY  NI  1  N  0-9 

**Level  of  apptitude  in  speaking  comprehension  for  a  foreign  language  (0=Unacceptable, 
l=Very  Poor,  2=Poor,  3=Below  Average,  4=Average,  5=Above  Average,  6=Good,  7=Very  Good, 
8=Excellent,  9=Fluent). 

PROFICIENCY. 

SOURCE  NI  1  N  0-9 

**Source  of  the  proficiency  ratings  for  Listening,  Speaking  and  Reading  (0=Assessment  by 
supervisor  on  duty,  l=Local  Test,  2=Formal  language  school,  4=Defense  Language  Institues, 
etc.). 
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APPENDIX  H 
OPINS  DATA  DEFINITIONS 


Officer  Personnel  Information  System  OPINS 
Attribute  Type  Length  Key  Range 


COMMISSIONED.OFFICER  Entity 

SSN  NI  9 

**Member's  Social  Security  Number. 


NAME 
-LAST 
-FIRST 
-MIDDLE 


27 

16 

10 

1 


N 


0-9 


A..Z 


**Member's  full  name  (in  Last ,  First ,  MI  format,  including  spaces). 


DATE_OF_ 

BIRTH  NI  6  N 

♦♦Member's  date  of  birth  (in  YYMMDD  format). 

SEX  C  1  N 

♦♦Member's  sex. 


0-9 


M,F 


Constraint 


Mandatory 


RACE  CI  N  A-Z 

♦♦Member's  race  (C=Caucasian,  N=Negroid,  H=Hispanic,  etc.). 

ETHNIC  CI  N  A-Z 

♦♦Member's  ethnic  group  (arbitrary  code,  A=North  European,  B=Canadian,  C=East 
European,  etc.). 


PEBD  NI  6  N 

♦♦Member's  pay  entry  base  date  (in  YYMMDD  format). 


0-9 


ACTIVE. 

COMMISSION, 

BASE.DATE  NI  6  N  0-9 

♦♦Member's  starting  date  of  commissioned  service  (in  YYMMDD  format). 

DESIGNATOR  NI  4  N  0-9  Mandatory 

♦♦Officer's  warfare  designator  (1110=Active  Duty  Surface  Warfare,  1115=Reserve  Surface 
Warfare,  etc.). 


DOR  NI  6  N 

♦♦Member's  date  of  present  grade  (in  YYMMDD  format). 


0-9 
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PRESENT.GRADE  NI  1  N  0-9  Mandatory 

"""Officer's  current  paygrade  (1=01,  2=02,  etc.,  covering  grades  from  Ensign  {01}  to  Admiral 
(O10J). 

ORIGINAL. 

SOURCE_CODE  CI  N  A-Z 

**Officer's  original  commissioning  source  (A=Naval  Academy,  R=  Reserve  Officer  Training 
Corps,  0=Officer  Candidate  School,  etc.). 


UNIT  Entity 

UNIT_ 

IDENTIFICATION, 

CODE  NI  10  N  0-9  Mandatory 

-ACTUAL.UIC  5  K 

-PARENTJJIC  5  Y 

**Navy  5  digit  unit  identification  code  for  both  unit  assigned,  and  Immediate  Superior  in 
Command  (ISIC)  of  that  unit. 

HOMEPORT  C  6  N  A-Z 

**Plain  language  name  (or  abbreviation)  of  unit's  assigned  homeport. 

TYPE_ 
ASSIGNMENT  NI  1  N  0-9 

**Unit's  duty  type  assingment  (0=Sea,  l=Continental  US,  2=Overseas,  etc.). 

ACTIVITY_TITLE  C  16  N  A-Z,  0-9 

**Unit's  plain  language  title  (or  abbreviation). 

BILLET_SEQUENCE_ 

NUMBER  NI  5  K  0-9 

**Specific  duty  assignment  by  billet  number  (12345=Commanding  Officer,  67890=Executive 
Officer,  etc.). 

DATE_ 

ASSIGNED  NI  4  K  0-9 

**Date  assignment  was  made  to  the  specific  duty  billet  (in  YYMM  format). 

FROM  NI  4  N  0-9 

**Date  the  specific  duty  billet  the  specific  duty  billet  assignment  was  assumed  (in  YYMM 

format). 

TO  NI  4  N  0-9 

**Date  the  specific  duty  billet  assignment  was  vacated  (in  YYMM  format). 
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LANGUAGE.SKILL  Entity 

CODE  NI  2  P  0-9 

Mandatory 
**Foreign  language  (01=Spanish,  43=Prench,  ll=Russian,  24=Arabic  (Iraqi  dialect), 
52=Farsi,  etc.). 

PROFICIENCY. 
YEAR  NI  2  N  0-9 

**Year  in  which  proficiency  in  a  language  was  most  recently  tested  (in  YYMM  format). 

METHOD  C  4  N  0-9 

-COMP  1 

-READ  1 

-WRITE  1 

-SPEAK  1 

**Method  used  to  appraise  the  level  of  aptitude  in  comprehension,  reading,  speaking, 
writing,  and  speaking  a  language  (0=Assessment  by  supervisor  on  duty,  l=Local  Test, 
2=Formal  language  school,  4=Defense  Language  Institute,  etc.). 

SKILL  C  4  N  0-9  Mandatory 

-COMP  1 

-READ  1 

-WRITE  1 

-SPEAK  1 

**Level  of  apptitude  in  comprehension,  reading,  speaking,  writing,  and  speaking  a  language 
(0-l=Poor,  2-4=Average,  5-7=Good,  8-9=Outstanding). 


SECUIUTY.REQUIREMENT  Entity 
SECURITY. 

CODE  NI  1  P  0-9  Mandatory 

**Level  of  security  classification  for  which  investigation  requirements  have  been  completed 
(0=None,  l=Classified,  2=Secret,  3=Top  Secret,  etc.). 

SECURITY. 
AGENCY  C  6  N  A-Z 

**Agency  abbreviation  which  completed  security  investigation  NVINSV=Naval  Investigative 
Sevice,  DFINSV=Defense  Investigative  Sevice,  FDBUIN=Federal  Bureau  of  Investigation, 
CTINAY=Central  Intelligence  Agency,  etc.). 

SECURITY. 

INVESTIGATION. 

DATE  NI  6  N  0-9 

**Date  of  completion  of  security  investigation  (in  YYMMDD  format). 
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YEAR.GROUP  Entity 

YEAR_GROUP_ID  NI  3  Y  0-9  Mandatory 

-YEAR  2 

-SPLIT  1 

**Promotian  year  group  (in  YY  ,  plus  {0=No  split,  l=Split,  lower  half,  2=Split  upper  half} 
format). 

PROJECTED. 
AUTHORIZED- 
PROMOTION. 
DATE  NI  6  N  0-9 

**Prospective  date  of  promotion  to  next  higher  rank  for  memebers  of  the  year  group  (in 

YYMMDD  format). 

PROJECTED. 
AUTHORIZED- 
PROMOTION. 
GRADE  NI  1  N  0-9 

**Prospective  next  rank  of  memebers  of  the  year  group  (1=01,  2=02,  etc.,  covering  grades 

from  Ensign  {01)  to  Admiral  {OlO}). 


EDUCATION  Entity 

COLLEGE.NAME  C  10  P  A.Z  Mandatory 

**Educational  institution  name  (or  abbreviation) 

YEAR. 

COMPLETED  NI  2  P  0-9  Mandatory 

**Year  in  which  a  course  of  education  was  completed  (in  YY  format). 

LEVEL  CI  N  A-Z 

**Level  of  course  of  education  (U=Undergraduate,  G=Graduate,  P=Postgraduate). 

DURATION  NI  2  N  0-9 

**Duration  (in  months)  of  course  of  education. 

MAJOR  NI  2  N  0-9 

** Academic  major  (12=Oceanography,  43=Aeronautical  Engineering,  55=Electrical 

Engineering,  etc.). 

SPECIALTY  NI  2  N  0-9 

**Naval  warfare  specialty  associated  with  course  of  education  (24=Surface  Warfare, 

55=Antisubmarine  Warafre,  87=Anitair  Warfare,  etc.). 

SPONSOR  NI  1  N  0-9 

**Navy  organization  which  sponsored  course  of  education  (3=Op-03,  4=Op-04,  8=Op--8,  etc.). 
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APPENDIX  I 
IMAPMIS  DATA  DEFINITIONS 


Inactive  Manpower  And  Personnel  Managment  Information  Sysy tern 
IMAPMIS 


Attribute 


Type  Length 


MEMBER  Entity 

SSN  C  9 

♦♦Member's  Social  Security  Number. 


MEMBER-NAME 
-SURNAME 
-FIRST 
-MIDDLE 
-POSITION 


27 
13 
7 
5 
2 


Key 

Y 

N 


♦♦Member's  full  name  (includes  JR.,  SR,  2,  3,  etc). 

DOB  C  6 

**Member's  date  of  birth  (in  YYMMDD  format). 

SEXC  C  1 

** Member's  sex 

RACE  C  1 

♦♦Member's  Race  (Caucasian,  African,  etc.). 

ETHN  C  1 

♦♦Member's  ethnic  group  (code). 

PEBD  C  6 

♦♦Member's  Pay  Entry  Base  date  (in  YYMMDD  format). 

OFCR-ENL-IND  C  1  N 

♦♦Member's  Officer/Enlisted  status. 


Range 


0-9 


A..Z,  1-9 


0,E 


Constraint 


Mandatory 


N 

0-9 

btwn  01701/00 
and  01701/99 

N 

M/F 

N 

C,M,A,I,H 

N 

0-9 

N 

0-9 

btwn  01/01/00 

and  01/01/99 


Mandatory 


MEMBER-OFR  Entity 
GRD-CD  C 

♦♦Officer's  present  rank  (01-09). 


N 


0-9 


Mandatory 
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SRCE-ORIG  C  3  N  a..Z,  0-9 

-SRCE-CD  2 

-SRCE-STAT  1 

"""Officer's  original  commissioning  source  and  active/reserve  status  (Code). 

DT-PRMTN  C  6  N  0-9  btwn  01/01/00 

and  01/01/99 
**Member's  date  of  Rank  (in  YYMMDD  format). 

DESIG  C  10  N  0-9 

-DESIG-STAT  1 

-DESIG-CAT  3 

-DESIG-DT  6 

**Officer's  warfare  designator  (3  digit  specialty,  and  1  digit  active/reserve  ind)  and  date  of 
award  of  designator  (in  YYMMDD  format). 

BASE-DT-CMSN- 
SVC  C  6  N  0-9  btwn  01/01/00 

and  01/01/99 
**Member's  date  of  commencement  of  commission  service  (in  YYMMDD  format). 


EDUC  Entity 

UNIV-NAME  C  10  P  A.Z  Mandatory 

**Educational  institution  name  (or  abbreviation) 

UNIV-DT-CMP  C  2  P  0-9  Mandatory 

**Year  in  which  a  course  of  education  was  completed  (in  Yy  format). 

UNIV-SPNSR  C  1  N  A-Z,  0-9 

**Navy  organization  which  sponsored  course  of  education  (3=Op-03,  4=Op-04,  8=Op-8, 

C=Navy  Comptroller,  P=Bureau  of  Naval  Personnel,  R=Chief  of  Naval  Reserve  Force,  etc.). 

UNIV-DUR  C  2  N  0-9 

**  Duration  (in  weeks)  of  course  of  education. 

UNIV-LVL  C  1  N  U,G,P 

**Level  of  course  of  education  (U=Undergraduate,  G=Graduate,  P= Postgraduate). 

UNrV-MAJ  C  2  N  0-9 

**Academic  major  (OC=Oceanography,  AE=Aeronautical  Engineering,  EE=Electrical 

Engineering,  etc.). 

UNrV-SPEC  C  2  N  0-9 

**Naval  warfare  specialty  associated  with  course  of  education  (SW=Surface  Warfare, 

AS=Antisubmarine  Warafre,  AA=Anitair  Warfare,  etc.). 


125 


Attribute 


Type  Length 


Key 


Range  Constraint 


AATY  Entity 

ACTIV-UIC  C  5  Y 

**Naval  5  digit  Unit  Identification  Code 

ACTY-LANG-NAME  C  26  N 

**Unit's  plain  language  title. 

GEOGRAPHIC-LOC  C  8  N 

**Unit's  geographic  location  (code). 

UNIT-ADRS  C  59  N 

-UNIT-ADRS-STRT  30 

-UNIT-ADRS-CITY  18 

-UNIT-ADRS-STAT  2 

-UNIT-ADRS-ZIP  9 

**Unit's  full  mailing  address  (including  9  digit  Zip  code). 

N 


AATY-ATC  C  3 

**Unit's  area  type  code  (Overseas,  Conus,  etc.). 

PROG-ELEMENT-CD  C  8 

**Unit's  budgetary  funding  program  element  code. 


N 


0-9 


A-Z,  0-9 


0-9 


A-Z.  0-9 


0-9 


Mandatory 


SCRTY-CLEAR  Entity 

SCRTY-INVST-DT  C  6  P  0-9 

**Date  of  completion  of  security  investigation  (in  YYMMDD  format). 

N  0-9 


SCRTY-INVST-TYPE  C  1 

**Type  of  security  investigation  completed  (Code) 


SCRTY-AGCY  C  1  N 

**Agency  which  completed  security  investigation  (code). 


0-9 


btwn  01/01/00 
and  01/01/99 


SCRTY-CLR-AUTH  C  1  N  U,C,S,T 

**Level  of  security  classification  authorized  as  a  result  of  the  security  investigation. 


SCRTY-CLR- 
AUTH-DT 


''Date  on  which  security  classification  was  authorized  (in  YYMMDD  format). 


0-9  btwn  01/01/00 

and  01/01/99 


LANG  Entity 

LANG-ID  C  2  P  0-9  Mandatory 

**Foreign  language  (SP=Spanish,  FR=French,  RU=Russian,  IQ=Arabic  (Iraqi  dialect}, 
FA=Farsi,  etc.). 
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LANG-APRSL  C  4  N  0-9  Mandatory 

-LANG-COMP  1 

-LANG-READ  1 

-LANG-WRITE  1 

-LANG-SPEAK  1 

**Level  of  apptitude  in  comprehension,  reading,  speaking,  writing,  and  speaking  a  language 
(0-l=Poor,  2-4=Average,  5-7=Good,  8-9=Outstanding). 

LANG-METH-APRSL  C  4  N  0-9 

-LANG-METH-COMP  1 

-LANG-METH-READ  1 

-LANG-METH-WRITE  1 

-LANG-METH-SPEAK  1 

**Method  used  to  appraise  the  level  of  aptitude  in  comprehension,  reading,  speaking, 
writing,  and  speaking  a  language  (Code). 

LANG-DT-TESTED  C  6  N  0-9  btwn  01/01/00 

and  01/01/99 
**Date  on  which  language  aptitude  was  appraised  (in  YYMMDD  format). 
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